Richard J. Feldmann
Division of Computer Research and Technology
National Institutes of Health, Bethesda, Maryland
George W. A. Milne
National Heart and Lung Institute
National Institutes of Health, Bethesda, Maryland
and
Stephen R. Heller
Management Information Data Systems Division
Environmental Protection Agency, Bethesda, Maryland
Introduction
The emergence of cyrstallographic data files (the Cambridge
Crystal file of small organic structures and the Brookhaven
National Laboratory file of protein structures) has led to the
development of systems of programs for search, retrieval and
display of various aspects of this data.
This paper explores the implications of this trend in the uses of
crystallographic information in an attempt to show what types of
information systems will be needed and how these information
systems interact with the basic process of structure
determination.
The main characteristics of the future crystallographic information system will most probably be universal accessibility and integration with other chemical and biochemical data bases. The short term problems of program development and implementation on a computer network are discussed in relation to the long term problems of integrating crystallographic data bases with other data bases by the use of the Chemical Abstracts Registry identifiers.
1. Changes in Use of Structure Solution Techniques
Clearly crystallographers will continue to investigate the
improved methods of structure solution and refinement. The
existence, however, of a data base of all previously solved
structures makes it possible to compare present and past
structure solutions. The existence of fully developed structure
solving computer program packages means that scientists in other
disciplines can begin to apply crystallographic structure solving
techniques to their problems. Care must be taken to make the
limitations of the techniques known to these users. This can be-accomplished by collaboration with crystallographers. Concomitant
with the shift in primary discipline of these new users of single
crystal structure solving techniques there is a shift in the type
of structures being solved. The shift is from one-time solution
of difficult structures to solution of relatively simple
structures as part of molecular assays. Crystallographers have
used structures of escalating difficulty as vehicles for
improving structure solving techniques. Now that the techniques
can handle structures with a wide range of molecular weights with
little difficulty, the non-crystallographer may consider applying
the techniques to solving structures within this class. Since
structures solved as part of a molecular assay contribute little
or nothing toward the improvement of structure solving techniques
we can expect a shift from the publication of structure solutions
to the determination of the similarity between a newly solved
structure and previously solved structures; or the entering of
structure solutions into the universal structure data base.
The size (3.5 million structures) and class composition of the
Chemical Abstracts Service (CAS) Structure Registry file
indicates that there are many simple structures within reach of
standard structure solving techniques.
As the general chemist, biochemist or physicist begins to look at
collections of structures, a strong pressure develops, to shift
from manual structure solving to automated solving of simple
structures.
Automation in this case is not the exclusion of human insight and
decision-making capability from the structure solving process.
Rather, it is the augmentation of human intellect by relegating,
once and for all, to the computer those portions of the structure
solving process which are purely routine. Automated structure
solution implies a shift from isolated structure solving
(computer controlled diffractometer, structure solving package,
journals and models) to integrated structure solving (computer
controlled diffractometer, structure solving package integrated
with computer bibliographic and structure search, and computer
modeling of solutions).
Automation is accomplished by integrating the procedures of each
step in the structure solving process to avoid computer to noncomputer and non-computer to computer interfaces, as currently
exist in the map calculation to map generation step and in the
map coordinates to structure manipulation (drawing) step.
Taken in themselves the changes going on in this type of
crystallography actually constitute only a change in extent. At the
same time, however, we are witnessing a very rapid evolution in
computer hardware, as explained in the next section.
2. Hardware Developments
The Large Scale Integration (LSI) of computer components is significantly affecting the cost of processors. This means that the cost of a given computational capability can be decreased, or that for constant cost the computational capability can be increased. Features which were formerly only possible in large computing systems are currently being incorporated into terminal devices. In the past a typical terminal device was a keyboard, a character printer and a telephone coupler. Such devices had no computational capability and essentially no graphic capability. The terminal device of the near future will have the following characteristics:
1. 128K to 256K bytes of memory
2. Microprogrammed central processing unit (CPU)
3. Interpreter for a high level language (BASIC, ALGOL or APL)
4. Line drawing graphics
5. Tape cassette or disk units
6. Communications interface
7. Instrumentation interface.
This level of capability can presently be obtained for under $10,000
and should drop to under $5,000 within three years if past hardware
prices are any indication of what will happen in the future. These
terminal devices have all of the characteristics of a complete
computing environment. Individual minicomputers in the past have had
some of these characteristics. The microprogrammed interpreter for a
high level language is beginning to appear in the new terminal
devices. This means that the high level language is the "machine-language".
The typical use of such a terminal device can be
1. Bibliographic and structure search
2. Control of or interaction with a diffractometer
3. Crystallographic data processing
4. Structure manipulation and display.
3. Information System Developments
As the LSI terminal devices proliferate both in number and in type in the next few years there will be a tendency to program each type of device as a special case. History is replete with parochial user groups which have sprung up around particular types of equipment. There is a need for a centralized source of crystallographic programs and structure data.
A centralized source of program and structure data has the following
characteristics:
1. It reduces the chance of duplication of effort
2. It provides a new and faster facility for propagating information
Journals in some ways act as a central source of crystallographic
programs and structure data. Any data found in a journal, however,
must be re-entered into computer form. The most effective central
source for computer related information is access to one or more
computers via a network.
A network provides
1. Communication over long distances
2. Small computations on data bases (search and display)
3. Access to bulk computation
A computer network is in effect a utility for data transmission and
computation. Networks come in three types of configurations
1. Point to point
2. Star
3. Symmetric
A point to point network can be developed by using the dial-up
telephone system or by leasing permanent lines. Long distance rates
are structured so that all distances greater than 600 miles
effectively have the same rate. Point to point data transmission is
very flexible. It allows scientists to make experiments in the use
of computer based systems without the need for extensive logistics
planning.
The star network brings all users together in each nodal city and
then carries all the user messages to the central computer system
over high speed telephone lines. There are several commercial star
networks in existence. The General Electric network for example has
nodes in all major North American cities, in the major cities of
Western Europe, and in several cities in Japan, Brazil and
Australia.
The symmetric network couples several independent computing systems,
transfers data between computers and brings the users to any of the
computing systems. There are several symmetric networks in
operation. TYMNET is a commercial symmetric network. Users enter
the network at
nodal cities. Various institutions have attached their computer to
the network. The National Library of Medicine operates MEDLINE ove
the TYMNET.
Steps which have already been taken in the process of experimenting
with the network use of crystallographic programs are:
1. Central program sources.
2. Point to point use of structure search and display systems
3. Implementation of data bases on a star network
4. Remote use of bulk computation.
The CRYSTNET project (1) has been concerned with the development of a display system for point to point use of crystallographic programs. The design of the display system is centered around a DEC PDP 11/4S computer and a Vector General display. These display systems interact over dial-up telephone lines with programs running on the Brookhaven CDC-6600. The rate structure of the large computers at the National Laboratories make them in effect
sources of bulk computation. Another aspect of the CRYSTNET project
has been concerned with the standardization of crystallographic data
processing programs. One of the uses of the standardized programs
and the display system at each of the three CRYSTNET installations
has been to prepare the protein structure file which is maintained
at Brookhaven (2).
Experimentation has been done at the National Institutes of Health
(NIH) on the use of the dial-up telephone system for accessing
the programs which search for and display molecular structures (3)
from the Cambridge Crystal Data base (4). The response by
crystallographers, biochemists and chemists to the contents and
style of use of the data base has been very good. More than twenty
copies of the search and display programs and the data base have
been exported to various institutions. Similar experimentation was
done earlier at the NIH (5) with the Aldermaston Mass Spectrum data
base (6). The strong user response led to the implementation in
collaboration with the Environmental Protection Agency (EPA) of that
data base on the General Electric computer network; and
more recently on the ADP-Cyphernetics
computer network. In the two years of operation the mass spectrum
retrieval system has attracted 150 world-wide users (7).
4. Problems in System Development
Early experience with the development of display systems and the
implementation and use of crystallographic programs and data bases
on computer networks has led to the recognition of a number of
problem areas. These problem areas are
1. Evolution of operating systems and strategies for the future
generations of terminal devices
2. Support of program collection, standardization and maintenance
3. Support of data bases on a network
4. Integration of crystallographic data bases with other chemical
and biochemical data bases
5. Solidification of the role of bulk computation systems
When the terminal device has only a character printer the user can
connect to the network, do a search, disconnect from the network and
look carefully at the results. The hard copy from the printer
provides the medium for information review: If the terminal device
has a graphics display and even a graphics hard copy unit, the user
can review graphical details at leisure. The incorporation of a CPU,
a storage device (a cassette or disk) and memory into terminal
devices opens up new possibilities for network use. The user can
follow the same connect-search-disconnect strategy but now the
search is recorded on the cassette or disk. The CPU in the terminal
device can then be used to re-search the retrieved results. For
example, a user could connect to the network, formulate a search
which would retrieve a suitably large or small sub-class of
compounds, retrieve the sub-class and then disconnect from the
network. If the terminal device has a program which can search the
retrieved structures, the user can retrieve more specific
structures. The structures thus retrieved from the local cassette or
disk can then be displayed and manipulated without further network
costs. The interactive nature of the program running in the terminal
device should be the same as the style of the program running on the
network. This program should be obtained from the central program
source by connecting to the network and issuing a command which
transfers the program from the network repository computer to the
terminal device.
The integrated terminal device can also be used for the initiation
of batch jobs on a bulk computer attached to the network. Again the
existence of a CPU, storage and memory in the terminal device opens
up new possibilities. The terminal device can be used to accept bulk
computation results from the network. The user can then review the
bulk computation results at leisure and reprocess selected
information in graphical terms. For example, the terminal device may
be controlling the diffractometer, or even communicating with an
even smaller CPU which in turn is controlling the diffractometer.
Data thus collected is shipped up the network to the bulk computer.
Data processing occurs and the results are shipped down the network
to the terminal device. The resulting maps (Patterson or electron
density) can be drawn on the terminal device. If programs are
available from the network program source, the map sections can be
drawn in stero and the molecule fitted to the electron density. This
achieves a very nice balance between the tasks which require a high
degree of interaction (diffractometer data collecting and molecular
graphics) and the tasks which requite either a powerful CPU or
extensive amounts of memory or storage.
Because the CPU's of terminal devices are microprogrammed to
interpret one or more high level languages, the central repository
will only have to be in terms of high level source statements. This
will considerably simplify the problem of making programs run a
number of different types of terminal devices. In the past when
machine language was the lowest level for programming, there was
always the possibility of descending from the high level language to
machine language to achieve some particular effect. Microprogramming
of a high level interpreter imposes other constraints. It seems
fairly clear that FORTRAN will not and perhaps should not be
interpreted by such microprogrammed CPU's. A more strongly structured language such as an ALGOL, a PL1 or APL is easier to
microprogram. As a central program source is developed,
consideration should be given to choice of program language since
FORTRAN programs are not easy to maintain. The CRYSTNET experience
supports this feeling.
The development of a central repository of crystallographic programs
not associated with a data base is conditioned among other things by
the problem of finding support for such an activity. National or
international support of program maintenance is very difficult.
Network surcharging offers one possible approach to supporting
program maintenance. Each time a program source is copied from the
network repository, a charge is levied against the user's network
account. This money goes to the organization or individual
designated as the maintainer of that particular program. The network
provides the medium for collecting the increments of user generated
support.
Because of the difference in size, the problems affecting data base
maintenance are far greater than those affecting program source
maintenance. Data bases are typically one to four orders of
magnitude larger than program sources. Whereas the disk space used
in storing program sources on a machine on a network is and will
probably continue to be negligible, the disk space used by any
important data base is very large. The Cambridge crystal file, as an
example, uses 1.3 x 10*7 bytes of storage and costs $500 per month to
store on the ADP-Ciphernetics network. The NIH/EPA experience with
the Aldermaston Mass Spectrum data base indicates that the user
subscription is a feasible mechanism for supporting a data base on a
commercial network. When a commercial star network is used as the
medium of access, the data base manager must either recover storage
charges from users or seek national or international subsidy. The
commercial symmetric network, however, permits an individual
institution to support the data base on its machine and allows that
institution the right to permit access by other network customers.
The symmetric network permits a broad range of institutions to enter
into the business of supporting scientific data bases as adjuncts of
scientific enterprises.
As more data bases come into existence or are made accessible
through computer networks, it becomes important to integrate the
information contained in the individual data bases. The Chemical
Abstracts Service (CAS) has been registering compounds for the last
ten years (8). CAS now has a data base of approximately 3.5
million compounds. The CAS registry identifier acts as the connection between different data bases. The NIH and the EPA are in the
process of obtaining CAS registry identifiers for the mass spectral,
nuclear magnetic resonance and crystallographic data bases. These
data bases, with the CAS-provided chemical connectivity and chemical
nomenclature, should form the basis for the accretion of other
chemical data bases. These integrated data bases will be available
on a commercial star network (ADP-Cyphernetics) in the near future.
The CAS registry identifier can serve as a link to searching on
TYMNET for references in the NLM TOXLINE and in the Lockheed data
base implementation of Chemical Abstracts Condensates.
The last of the problem areas to be discussed is that of solidifying
the role of bulk computer systems. There are substantial differences
in the philosophies pertaining in the United States on one hand and
vvvvmany European countries on the other. The ready availability of
extremely inexpensive bulk time on subsidized machines in European
laboratories presents star networks with a more serious barrier to
acceptance than they experience in the United States. The safest
strategy assumes that there will always be sources of bulk
computation independent of terminal power. The National Resource for
Computation in Chemistry (NRCC) (9) is the current example of a bulk
source trying to come into existence. If the NRCC exists on a
network, crystallographers will have controlled-access to large
blocks of computation. It is not clear whether NRCC should be
concerned with program maintenance because the funding for program
maintenance would get confused with the funding for bulk
computation. Bulk computation can be considered as one example of
special purpose computation. In the next few years special purpose
processors will be readily constructed from LSI components. Wilson
'I is currently building a processor which will simulate the
Newtonian forces between atoms in a large molecule. Barry at
Washington University in St. Louis is building a processor which
rapidly searches for conformations of a molecule as specified by NMR
data. Inevitably other processors, some perhaps of interest to crystallographers, will be built. In order to find the widest possible
audience, these special purpose processors should be connected to a
network.
5. Conclusions
To the extent that certain single crystal structure solving techniques
become more automated, non-crystallographers will begin to use these
techniques in pursuit of their own goals. There are two conditions
external to crystallography which make it easier for this to come
about. Developments in the large scale integration of computer
componentry will make it possible to accomplish most of the
crystallographic data processing functions in an inexpensive terminal
device. The development of computer networks will bring chemical and
crystallographic information as well as bulk computational capability
to the scientist solving structures.
6. References
(1) E. F. Meyer, Morimoto, C. N., Villarreal, J., Berman, H. M.,
Carrell, H. L., Stodolo, R. K., Koetzle, T. F., Andrews, L. C.,
Bernstein, F. D., Bernstein, Fed. Proc. 33:2402-2405.(1975).
(2) Protein Data Bank, Acta Cryst. B29:1746. (1973).
(3) R. J. Feldmann, CHEM - Crystal Search and Retrieval. Bethesda,
Md.: Division of Computer Research and Technology, National
Institutes of Health, 325 pp. (1974).
(4) F. H. Allen, O. Kennard, W. D. S. Motherwell, W. G. Town, D. G.
Watson, Cambridge crystallographic data centre. Il. Structural data
file. J. Chem. Doc. 13:119-123. (1974).
(5) S. R. Heller, Anal. Chem. 44:1951-1961.(1972).
(6) R. G. Ridley, G. R. Waller, Biochemical Applications _ Mass
Spectrometry. New York: John Wiley. 835-836 pp. (1972).
(7) G. W. A. Milne, and S. R. Heller, 1976 J. Chem. Info. and Comp.
Sei..(in press).
(8) D. P. Leiter, H. L. Morgan, R. E. Stobrough, J. Chem. Doe.
5:238-242.
(9) National Research Council. A study of national center for
computation in chemistry. Washington: National Academy of Sciences.
79 pp. (1974).
(10) Wilson, K. R. Computer Networking and Chemistry, ed. P. Lykos,
Amer. Chem. Soc., Wash. D. C. (1975).
Discussion
Q: (S. C. Abrahams) I thought I detected in the first part of your
talk a sense that crystallography would become a routine tool. May I
draw your attention to a recent report on the 'Status of
Crystallography' put out by the U.S.A. National Committee for
Crystallography. I think that you will find that there are many
areas which are anything but routine.
A: Yes. Perhaps I was stressing that aspect of crystallography in
which the use of the information is emphasized over the specifics of
the science itself.
Q: (J. Ladell) Intelligent terminals today cost about $6000 and will
probably not go down below $4000, but the real cost is software. I
believe crystallographer's software efforts should go into using
small computers, after techniques have been developed on large
systems, perhaps, to put their own expertise to work, as Dr. Parrish
has done. To be involved with a larger system may stifle innovation.
A: Costs may drop further. But for your major point, crystallography
will certainly continue to developed. My concern is the way the
information can be used.
Q: (C. K. Johnson) Crystallography is evolving and changing ways of
using computers is part of that. Neither have leveled off as yet. My
experience with networks, however, is that so far they only offer
'promise'.
A: Again, I believe that this is the way we can most effectively
address the question of how to use the information, and several
networks are operating now.
Q.: (J. Karle) May I return briefly to the point that Dr. Abrahams
made. The developments you have described are exciting, for all
scientists, but need to be understood as different from the 'future
of crystallography'. Crystallography is not concerned with just
molecular arrangements which can be solved easily given nice single
crystals. The future of crystallography is concerned more with
solving problems in all of those areas with complex materials which
often are not even 'crystalline' in the above sense.
A: Thank you. Certainly my remarks are aimed at a tool for future
use by crystallographers and others.