CRYSTALLOGRAPHIC DATA RETRIEVAL AND DISPLAY: PROJECTION OF INFORMATION SYSTEM DEVELOPMENT



Richard J. Feldmann
Division of Computer Research and Technology
National Institutes of Health, Bethesda, Maryland

George W. A. Milne
National Heart and Lung Institute
National Institutes of Health, Bethesda, Maryland


and

Stephen R. Heller
Management Information Data Systems Division
Environmental Protection Agency, Bethesda, Maryland



Introduction

The emergence of cyrstallographic data files (the Cambridge Crystal file of small organic structures and the Brookhaven National Laboratory file of protein structures) has led to the development of systems of programs for search, retrieval and display of various aspects of this data.

This paper explores the implications of this trend in the uses of crystallographic information in an attempt to show what types of information systems will be needed and how these information systems interact with the basic process of structure determination.

The main characteristics of the future crystallographic information system will most probably be universal accessibility and integration with other chemical and biochemical data bases. The short term problems of program development and implementation on a computer network are discussed in relation to the long term problems of integrating crystallographic data bases with other data bases by the use of the Chemical Abstracts Registry identifiers.

1. Changes in Use of Structure Solution Techniques

Clearly crystallographers will continue to investigate the improved methods of structure solution and refinement. The existence, however, of a data base of all previously solved structures makes it possible to compare present and past structure solutions. The existence of fully developed structure solving computer program packages means that scientists in other disciplines can begin to apply crystallographic structure solving techniques to their problems. Care must be taken to make the limitations of the techniques known to these users. This can be-accomplished by collaboration with crystallographers. Concomitant with the shift in primary discipline of these new users of single crystal structure solving techniques there is a shift in the type of structures being solved. The shift is from one-time solution of difficult structures to solution of relatively simple structures as part of molecular assays. Crystallographers have used structures of escalating difficulty as vehicles for improving structure solving techniques. Now that the techniques can handle structures with a wide range of molecular weights with little difficulty, the non-crystallographer may consider applying the techniques to solving structures within this class. Since structures solved as part of a molecular assay contribute little or nothing toward the improvement of structure solving techniques we can expect a shift from the publication of structure solutions to the determination of the similarity between a newly solved structure and previously solved structures; or the entering of structure solutions into the universal structure data base.

The size (3.5 million structures) and class composition of the Chemical Abstracts Service (CAS) Structure Registry file indicates that there are many simple structures within reach of standard structure solving techniques.

As the general chemist, biochemist or physicist begins to look at collections of structures, a strong pressure develops, to shift from manual structure solving to automated solving of simple structures.

Automation in this case is not the exclusion of human insight and decision-making capability from the structure solving process. Rather, it is the augmentation of human intellect by relegating, once and for all, to the computer those portions of the structure solving process which are purely routine. Automated structure solution implies a shift from isolated structure solving (computer controlled diffractometer, structure solving package, journals and models) to integrated structure solving (computer controlled diffractometer, structure solving package integrated with computer bibliographic and structure search, and computer modeling of solutions).

Automation is accomplished by integrating the procedures of each step in the structure solving process to avoid computer to noncomputer and non-computer to computer interfaces, as currently exist in the map calculation to map generation step and in the map coordinates to structure manipulation (drawing) step.

Taken in themselves the changes going on in this type of crystallography actually constitute only a change in extent. At the same time, however, we are witnessing a very rapid evolution in computer hardware, as explained in the next section.

2. Hardware Developments

The Large Scale Integration (LSI) of computer components is significantly affecting the cost of processors. This means that the cost of a given computational capability can be decreased, or that for constant cost the computational capability can be increased. Features which were formerly only possible in large computing systems are currently being incorporated into terminal devices. In the past a typical terminal device was a keyboard, a character printer and a telephone coupler. Such devices had no computational capability and essentially no graphic capability. The terminal device of the near future will have the following characteristics:

1. 128K to 256K bytes of memory

2. Microprogrammed central processing unit (CPU)

3. Interpreter for a high level language (BASIC, ALGOL or APL)

4. Line drawing graphics

5. Tape cassette or disk units

6. Communications interface

7. Instrumentation interface.

This level of capability can presently be obtained for under $10,000 and should drop to under $5,000 within three years if past hardware prices are any indication of what will happen in the future. These terminal devices have all of the characteristics of a complete computing environment. Individual minicomputers in the past have had some of these characteristics. The microprogrammed interpreter for a high level language is beginning to appear in the new terminal devices. This means that the high level language is the "machine-language".

The typical use of such a terminal device can be

1. Bibliographic and structure search

2. Control of or interaction with a diffractometer

3. Crystallographic data processing

4. Structure manipulation and display.

3. Information System Developments

As the LSI terminal devices proliferate both in number and in type in the next few years there will be a tendency to program each type of device as a special case. History is replete with parochial user groups which have sprung up around particular types of equipment. There is a need for a centralized source of crystallographic programs and structure data.

A centralized source of program and structure data has the following characteristics:

1. It reduces the chance of duplication of effort

2. It provides a new and faster facility for propagating information

Journals in some ways act as a central source of crystallographic programs and structure data. Any data found in a journal, however, must be re-entered into computer form. The most effective central source for computer related information is access to one or more computers via a network.

A network provides

1. Communication over long distances

2. Small computations on data bases (search and display)

3. Access to bulk computation

A computer network is in effect a utility for data transmission and computation. Networks come in three types of configurations

1. Point to point

2. Star

3. Symmetric

A point to point network can be developed by using the dial-up telephone system or by leasing permanent lines. Long distance rates are structured so that all distances greater than 600 miles effectively have the same rate. Point to point data transmission is very flexible. It allows scientists to make experiments in the use of computer based systems without the need for extensive logistics planning.

The star network brings all users together in each nodal city and then carries all the user messages to the central computer system over high speed telephone lines. There are several commercial star networks in existence. The General Electric network for example has nodes in all major North American cities, in the major cities of Western Europe, and in several cities in Japan, Brazil and Australia.

The symmetric network couples several independent computing systems, transfers data between computers and brings the users to any of the computing systems. There are several symmetric networks in operation. TYMNET is a commercial symmetric network. Users enter the network at nodal cities. Various institutions have attached their computer to the network. The National Library of Medicine operates MEDLINE ove the TYMNET.

Steps which have already been taken in the process of experimenting with the network use of crystallographic programs are:

1. Central program sources.

2. Point to point use of structure search and display systems

3. Implementation of data bases on a star network

4. Remote use of bulk computation.

The CRYSTNET project (1) has been concerned with the development of a display system for point to point use of crystallographic programs. The design of the display system is centered around a DEC PDP 11/4S computer and a Vector General display. These display systems interact over dial-up telephone lines with programs running on the Brookhaven CDC-6600. The rate structure of the large computers at the National Laboratories make them in effect

sources of bulk computation. Another aspect of the CRYSTNET project has been concerned with the standardization of crystallographic data processing programs. One of the uses of the standardized programs and the display system at each of the three CRYSTNET installations has been to prepare the protein structure file which is maintained at Brookhaven (2).

Experimentation has been done at the National Institutes of Health (NIH) on the use of the dial-up telephone system for accessing the programs which search for and display molecular structures (3) from the Cambridge Crystal Data base (4). The response by crystallographers, biochemists and chemists to the contents and style of use of the data base has been very good. More than twenty copies of the search and display programs and the data base have been exported to various institutions. Similar experimentation was done earlier at the NIH (5) with the Aldermaston Mass Spectrum data base (6). The strong user response led to the implementation in collaboration with the Environmental Protection Agency (EPA) of that data base on the General Electric computer network; and more recently on the ADP-Cyphernetics computer network. In the two years of operation the mass spectrum retrieval system has attracted 150 world-wide users (7).

4. Problems in System Development

Early experience with the development of display systems and the implementation and use of crystallographic programs and data bases on computer networks has led to the recognition of a number of problem areas. These problem areas are

1. Evolution of operating systems and strategies for the future generations of terminal devices

2. Support of program collection, standardization and maintenance

3. Support of data bases on a network

4. Integration of crystallographic data bases with other chemical and biochemical data bases

5. Solidification of the role of bulk computation systems

When the terminal device has only a character printer the user can connect to the network, do a search, disconnect from the network and look carefully at the results. The hard copy from the printer provides the medium for information review: If the terminal device has a graphics display and even a graphics hard copy unit, the user can review graphical details at leisure. The incorporation of a CPU, a storage device (a cassette or disk) and memory into terminal devices opens up new possibilities for network use. The user can follow the same connect-search-disconnect strategy but now the search is recorded on the cassette or disk. The CPU in the terminal device can then be used to re-search the retrieved results. For example, a user could connect to the network, formulate a search which would retrieve a suitably large or small sub-class of compounds, retrieve the sub-class and then disconnect from the network. If the terminal device has a program which can search the retrieved structures, the user can retrieve more specific structures. The structures thus retrieved from the local cassette or disk can then be displayed and manipulated without further network costs. The interactive nature of the program running in the terminal device should be the same as the style of the program running on the network. This program should be obtained from the central program source by connecting to the network and issuing a command which transfers the program from the network repository computer to the terminal device.

The integrated terminal device can also be used for the initiation of batch jobs on a bulk computer attached to the network. Again the existence of a CPU, storage and memory in the terminal device opens up new possibilities. The terminal device can be used to accept bulk computation results from the network. The user can then review the bulk computation results at leisure and reprocess selected information in graphical terms. For example, the terminal device may be controlling the diffractometer, or even communicating with an even smaller CPU which in turn is controlling the diffractometer. Data thus collected is shipped up the network to the bulk computer. Data processing occurs and the results are shipped down the network to the terminal device. The resulting maps (Patterson or electron density) can be drawn on the terminal device. If programs are available from the network program source, the map sections can be drawn in stero and the molecule fitted to the electron density. This achieves a very nice balance between the tasks which require a high degree of interaction (diffractometer data collecting and molecular graphics) and the tasks which requite either a powerful CPU or extensive amounts of memory or storage.

Because the CPU's of terminal devices are microprogrammed to interpret one or more high level languages, the central repository will only have to be in terms of high level source statements. This will considerably simplify the problem of making programs run a number of different types of terminal devices. In the past when machine language was the lowest level for programming, there was always the possibility of descending from the high level language to machine language to achieve some particular effect. Microprogramming of a high level interpreter imposes other constraints. It seems fairly clear that FORTRAN will not and perhaps should not be interpreted by such microprogrammed CPU's. A more strongly structured language such as an ALGOL, a PL1 or APL is easier to microprogram. As a central program source is developed, consideration should be given to choice of program language since FORTRAN programs are not easy to maintain. The CRYSTNET experience supports this feeling.

The development of a central repository of crystallographic programs not associated with a data base is conditioned among other things by the problem of finding support for such an activity. National or international support of program maintenance is very difficult. Network surcharging offers one possible approach to supporting program maintenance. Each time a program source is copied from the network repository, a charge is levied against the user's network account. This money goes to the organization or individual designated as the maintainer of that particular program. The network provides the medium for collecting the increments of user generated support.

Because of the difference in size, the problems affecting data base maintenance are far greater than those affecting program source maintenance. Data bases are typically one to four orders of magnitude larger than program sources. Whereas the disk space used in storing program sources on a machine on a network is and will probably continue to be negligible, the disk space used by any important data base is very large. The Cambridge crystal file, as an example, uses 1.3 x 10*7 bytes of storage and costs $500 per month to store on the ADP-Ciphernetics network. The NIH/EPA experience with the Aldermaston Mass Spectrum data base indicates that the user subscription is a feasible mechanism for supporting a data base on a commercial network. When a commercial star network is used as the medium of access, the data base manager must either recover storage charges from users or seek national or international subsidy. The commercial symmetric network, however, permits an individual institution to support the data base on its machine and allows that institution the right to permit access by other network customers. The symmetric network permits a broad range of institutions to enter into the business of supporting scientific data bases as adjuncts of scientific enterprises.

As more data bases come into existence or are made accessible through computer networks, it becomes important to integrate the information contained in the individual data bases. The Chemical Abstracts Service (CAS) has been registering compounds for the last ten years (8). CAS now has a data base of approximately 3.5 million compounds. The CAS registry identifier acts as the connection between different data bases. The NIH and the EPA are in the process of obtaining CAS registry identifiers for the mass spectral, nuclear magnetic resonance and crystallographic data bases. These data bases, with the CAS-provided chemical connectivity and chemical nomenclature, should form the basis for the accretion of other chemical data bases. These integrated data bases will be available on a commercial star network (ADP-Cyphernetics) in the near future. The CAS registry identifier can serve as a link to searching on TYMNET for references in the NLM TOXLINE and in the Lockheed data base implementation of Chemical Abstracts Condensates.

The last of the problem areas to be discussed is that of solidifying the role of bulk computer systems. There are substantial differences in the philosophies pertaining in the United States on one hand and vvvvmany European countries on the other. The ready availability of extremely inexpensive bulk time on subsidized machines in European laboratories presents star networks with a more serious barrier to acceptance than they experience in the United States. The safest strategy assumes that there will always be sources of bulk computation independent of terminal power. The National Resource for Computation in Chemistry (NRCC) (9) is the current example of a bulk source trying to come into existence. If the NRCC exists on a network, crystallographers will have controlled-access to large blocks of computation. It is not clear whether NRCC should be concerned with program maintenance because the funding for program maintenance would get confused with the funding for bulk computation. Bulk computation can be considered as one example of special purpose computation. In the next few years special purpose processors will be readily constructed from LSI components. Wilson 'I is currently building a processor which will simulate the Newtonian forces between atoms in a large molecule. Barry at Washington University in St. Louis is building a processor which rapidly searches for conformations of a molecule as specified by NMR data. Inevitably other processors, some perhaps of interest to crystallographers, will be built. In order to find the widest possible audience, these special purpose processors should be connected to a network.

5. Conclusions

To the extent that certain single crystal structure solving techniques become more automated, non-crystallographers will begin to use these techniques in pursuit of their own goals. There are two conditions external to crystallography which make it easier for this to come about. Developments in the large scale integration of computer componentry will make it possible to accomplish most of the crystallographic data processing functions in an inexpensive terminal device. The development of computer networks will bring chemical and crystallographic information as well as bulk computational capability to the scientist solving structures.

6. References

(1) E. F. Meyer, Morimoto, C. N., Villarreal, J., Berman, H. M., Carrell, H. L., Stodolo, R. K., Koetzle, T. F., Andrews, L. C., Bernstein, F. D., Bernstein, Fed. Proc. 33:2402-2405.(1975).

(2) Protein Data Bank, Acta Cryst. B29:1746. (1973).

(3) R. J. Feldmann, CHEM - Crystal Search and Retrieval. Bethesda, Md.: Division of Computer Research and Technology, National Institutes of Health, 325 pp. (1974).

(4) F. H. Allen, O. Kennard, W. D. S. Motherwell, W. G. Town, D. G. Watson, Cambridge crystallographic data centre. Il. Structural data file. J. Chem. Doc. 13:119-123. (1974).

(5) S. R. Heller, Anal. Chem. 44:1951-1961.(1972).

(6) R. G. Ridley, G. R. Waller, Biochemical Applications _ Mass Spectrometry. New York: John Wiley. 835-836 pp. (1972).

(7) G. W. A. Milne, and S. R. Heller, 1976 J. Chem. Info. and Comp. Sei..(in press).

(8) D. P. Leiter, H. L. Morgan, R. E. Stobrough, J. Chem. Doe. 5:238-242.

(9) National Research Council. A study of national center for computation in chemistry. Washington: National Academy of Sciences. 79 pp. (1974).

(10) Wilson, K. R. Computer Networking and Chemistry, ed. P. Lykos, Amer. Chem. Soc., Wash. D. C. (1975).



Discussion

Q: (S. C. Abrahams) I thought I detected in the first part of your talk a sense that crystallography would become a routine tool. May I draw your attention to a recent report on the 'Status of Crystallography' put out by the U.S.A. National Committee for Crystallography. I think that you will find that there are many areas which are anything but routine.

A: Yes. Perhaps I was stressing that aspect of crystallography in which the use of the information is emphasized over the specifics of the science itself.

Q: (J. Ladell) Intelligent terminals today cost about $6000 and will probably not go down below $4000, but the real cost is software. I believe crystallographer's software efforts should go into using small computers, after techniques have been developed on large systems, perhaps, to put their own expertise to work, as Dr. Parrish has done. To be involved with a larger system may stifle innovation.

A: Costs may drop further. But for your major point, crystallography will certainly continue to developed. My concern is the way the information can be used.

Q: (C. K. Johnson) Crystallography is evolving and changing ways of using computers is part of that. Neither have leveled off as yet. My experience with networks, however, is that so far they only offer 'promise'.

A: Again, I believe that this is the way we can most effectively address the question of how to use the information, and several networks are operating now.

Q.: (J. Karle) May I return briefly to the point that Dr. Abrahams made. The developments you have described are exciting, for all scientists, but need to be understood as different from the 'future of crystallography'. Crystallography is not concerned with just molecular arrangements which can be solved easily given nice single crystals. The future of crystallography is concerned more with solving problems in all of those areas with complex materials which often are not even 'crystalline' in the above sense.

A: Thank you. Certainly my remarks are aimed at a tool for future use by crystallographers and others.