The NIH-EPA Chemical Information System

G. W. A. MILNE

National Institutes of Health, Bethesda, MD 20014

S. R. HELLER

Environmental Protection Agency, Washington, DC 20460

The quantity of data associated with analytical chemistry has been expanding very rapidly during the last twenty years or so, but until recently, the efficient application of computers to this problem has been vitiated by the high costs of computer storage and computation. With the continuing improvement in computer technology and the steady decrease in computation costs, it has, in the last two years, become feasible to consider the development of a completely interactive chemical information system.

Earlier searching systems avoided the cost of bulk storage by maintaining the data files on magnetic tape rather than disk. Tape is a very cheap form of storage but its use implies batch searching which is necessarily slow because tape in not susceptible to random access. Magnetic disks, on the other hand, are random access devices and the data stored on them can be searched very rapidly. Until recently, the cost of storage on disks has precluded their use for large data bases, but now it is becoming practical to consider this approach. Since this permits interactive computing, we have developed a chemical information system that uses disk exclusively for the storage of data.

Interactive computing is a significantly different process from batch, or off-line, computing and a different philosophy can be used in the design of program for such work. A major problem that a chemist has in searching a chemical data base, is that the best questions to ask are often not known. An interactive system can provide the answer to a question immediately and this will enable the user to see the deficiencies in the question and to frame a new query. In this way, there can be build a feedback loop in which the chemist acts as a transducer, "tuning" the query until th system reports precisely what is required

The NIH-EPA Chemical Information System (CIS), which is described in this chapter, has been designed around this general approach (1).

System Design

The CIS consists of a collection of chemical data bases together with a battery of programs for interactive searching through these disk-stored data bases. In addition, there are a series of programs for the analysis of data, either to reduce them to a form suitable for searching purposes or as an end in itself.

The data bases that are in the CIS include files of mass spectra, carbon-13 nuclear magnetic resonance spectra, x-ray diffraction data for crystals and powders, and several bibliographic data bases. The analytical programs include a family of statistical analysis and mathematical modelling algorithms and programs for the calculation of isotopic enrichment from mass spectral data, analysis of nmr spectra and energy minimization of conformational structures.

a. Addition of components to the CIS. A general protocol for updating of CIS components or the addition to the CIS of new components has been established and a schematic diagram of this is shown in Figure 1.

In the first phase, a data base is acquired, if necessary from one of a variety of sources. Some of the CIS data bases have been developed specifically for the CIS, and example of this being the mass spectral data base (2). Other data bases, such as the Cambridge Crystal File (3), are leased for use in the CIS and still others, such as the X-ray powder diffraction file (4), are operated within the CIS by their owners. Next, the necessary program development is undertaken. If the component is one involving searching of a data base, some reformatting of the data base, sorting and inversion of files and so on, is usually required, and this is carried out on the NIH IBM 360-168, which is well-suited to processing of large files of data. Once searchable files have been prepared, they are transferred to the NIH PDP-10 computer which is primarily a time-sharing, computer, and the programs for searching through the data bases are written. The analytical, data base-independent programs of the CIS are usually written entirely on the PDP-10. Out of this work there finally emerges a pilot version of the component.

The pilot version is then allowed to run on the NIH PDP-10 and access to it is provided to a small number of people who can log into the NIH computer by telephone, using long distance calls if necessary. These users are provided with free computation and in return, they test the component throughly for errors and deficiencies. Such problems are reported to the development team, which attempts to deal with them. Depending upon the size and complexity of the component, this testing phase can last as long as eighteen months.

When testing is complete, the entire component is exported to a networked PDP-10 in the private sector and the version on the NIH computer is dispensed with. The component in the private sector is available to the general scientific community and can be used on a fee-for-service basis. In this phase, the government retains no financial interest in the component; it is "managed" by a sponsor outside the U.S. Government. The Department of Industry of the British Government, for example, maintains the Mass Spectral Search System on the network. Advise and consultation between such sponsors and NIH/EPA personnel continues, but the U.S. Government does not subsidize the routine operation of CIS components in the private sector. In fact, various government agencies of the government are actually users of the CIS and they pay according to their use of the system, like any other user. Charges for use of these components must be designed to cover costs, and if the component attracts insufficient use of these prices, then it is probably not viable and its sponsor need not continue to support it.

b. Computers facilities used by the CIS. Programs of the CIS have usually been designed for use with a DEC PDP-10 computer system. The reason for this is that the PDP-10 is one of the better time-sharing systems available and has been adopted by a number of commercial computer network companies as the main vehicle for their networks. Transfer of a program from the NIH-PDP-10 to a network PDP-10 is usually straightforward, and use of a networked computer is favored because the alternative philosophy of exporting programs and data bases to locally operated PDP-10 computers is less workable. This latter approach contains a number of deficiencies that are overcome by a network. Most important in this connection is the fact that use of a network machine means that data bases need only be stored once, at the center of the network. A great deal of money is thus saved because duplicate storage is not necessary. Further, a single copy of a data base is easy to maintain, whereas updating of a data base that resides on many computers is virtually impossible. Finally, communications between systems, personnel, and users is very simple in a network environment, as is monitoring of system performance.

For these and other reasons, the policy of disseminating the CIS via networked PDP-10 computers was adopted at the outset and has proved to be quite successful. A typical U.S. network of this sort has something under 100 nodes - i.e., local telephone call access is available in about 100 locations. These are mainly in the U.S., but a substantial number will be found in Europe. Further, some computer networks are now themselves interfaced to the Telex network, thus making their computer system available worldwide. Irrespective of one's location, the cost of access is somewhere between $7 and $15 per hour, depending upon the transmission speed used and also on the time of day. Networks usually offer 110, 300 and 1200 baud service and the response time of the system is usually negligibly small.

The only equipment that is required to establish access to a computer network, is a telephone-coupled computer terminal. Typewriter terminals are becoming very common and are also becoming relatively cheap. Such a terminal can be purchased from a variety of manufacturers for between $1,000 and $3,000 and in general, will operate at 300 baud (30 characters/second). A cathode ray terminal, capable of running at 1200 baud can be purchased for as little as $2,000. Any equipment of this sort can usually be leased or purchased.

Components of the CIS

a. Mass Spectral Search System. The Mass Spectral Search System (MSSS) is the oldest component of the CIS, having been developed in 1971, and has been seen as a prototype for more recent components. Developed as a joint effort between NIH, EPA, and the Mass Spectrometry Data Centre (MSDC) in England, the current MSSS data base contains about 30,000 mass spectra representing the same number of compounds. This has been derived from an archival file containing some 60,000 spectra of the same 30,000 compounds (5). Computer techniques have been employed to assign every spectrum a quality index (6) and where duplicate spectra appear in the archive file, only the best spectrum is used in the working file. All compounds in the archive have been assigned Chemical Abstracts Service (CAS) registry number, a unique identifier that is used to locate duplicate entries for the compounds, find the compound in other CIS files and provide structure and synonym lookup capabilities throughout the CIS.

Searchers through the MSSS data base can be carried out in a number of ways. With the mass spectrum of an unknown in hand, the search can be conducted interactively, as is shown in Figure 2. In this search the user finds that 24 data base spectra have a base peak (minimum intensity 100%, maxima intensity 100%) at an m/e value of 344. When this subset is examined for spectra containing a peak at m/e 326 with intensity of less than 10%, only 2 spectra are found. If necessary, the search can be continued in this way until a manageable number of spectra are retrieved as fulfilling all criteria that the user cited. These answers can then be listed as is show in Figure 2. Alternatively, the file can be examined for all occurrences of a specific molecular weight or a partial or complete molecular formula. Combinations of these properties can also be used in searches. Thus, all compounds containing, for example, five chlorines and whose mass spectra have a base peak at a particular m/e value can be identified.

In contrast to these interactive searches, which are of little appeal to those with large numbers of searches to carry out, there is available a batch-type search which accepts the complete spectrum of the unknown and sequentially examines all spectra in the file to find the best fits. A user's data system can be connected to the network for this purpose and the unknown spectra can be down-loaded into the network computer for use in this search, which can be carried out at once, or, preferred, overnight at 30% of the cost.

Once an identification has been made, the name and registry number of the data base compound are reported to the user. If necessary, the data base spectrum can be listed or, if a CRT terminal is being used, plotted, to facilitate direct comparison of the unknown and standard spectra.

The MSSS has been generally available through computer network for several years and is now currently resident upon the ADP-Cyphernetics network where some 3,000 searches and 2,000 other transactions, such as retrievals, are carried out each month by the approximately 200 users. All searches in the MSSS are transaction priced at between $1 and $7 and in addition to these charges and the connect time charge, users must pay the annual subscription fee of $300. This fee is used to defray the annual disk storage charges which are paid in advance by the sponsor of the MSSS, the Department of Industry of the British Government.

b. Carbon Nuclear Magnetic Resonance (CNMR) Spectral Search System. The data base that is used in the CNMR search system consists currently of 4, 400 CNMR spectra. As in the case of the MSSS, every compound has a CAS registry number, and all exact duplicates have been removed from the file. A Specific compound may still appear in this file more than once, however, because its CNMR spectrum may have been recorded in different solvent. The CNMR file is still small but is growing at a fairly steady rate and should benefit considerably from recent international agreements to the effect that all major compilations of CNMR data will, in the future, be pooled.

Searching through this data base, as in the case of the MSSS, can be interactive or not. In the interactive search, a user enters a shift, with an acceptable deviation, and the single frequency off resonance decoupled multiplicity, if that is know. The program reports the number of files spectra fitting one or both of these criteria. The names of the compounds whose spectra have been retrieved can be listed, or alternatively, the list can be reduced by the entry of a second chemical shift. A search for spectra of compounds having a specific complete or partial molecular formula can also be carried out, but there is no capability for searching on molecular weight, a parameter of little relevance to CNMR spectroscopy.

Figure 1. Protocol for adding a component to the CIS

Figure 2. PEAK search in the MSSS

If an interactive search in not appropriate to the problem at hand, a batch type of search through the data base using the techniques described by Clerc et al. (7) is available. To carry out such a research, the user enters all the chemical shifts from the unknown ans starts the search. The entire unknown spectrum is compared to every entry in the file and the best fits are noted and reported to the user. This program searches for the absence of peaks in a given region as well as for the presence of peaks and thus has the capability of finding those compounds which are structurally similar to the material that gave the unknown spectrum.

When a search is completed, the user is provided with the accession numbers of spectra that fit the input data. The names and CAS registry numbers of the compounds in question will also be given. If more information is required, the complete entry for a given accession number can be retrieved. This includes a numbered structural formula, the name, molecular formula and registry number of the compound, experimental data pertaining to the spectrum and the entire spectrum, together with single frequency off-resonance decoupled multiplicities and (if available) relative line intensities and assignments.

This CNMR search system recently has been made available on the ADP-Cyphernetics network. Searches are all transaction priced at $1-3.

c. X-ray Crystallographic Search System. This is a series of search programs working against the Cambridge Crystal File (8), a data base of some 15,000 entries dealing with published crystallographic data mainly for organic compounds. The entry for each compound contains the compound name, its molecular weight and registry number, the space group in which it crystallizes and the parameters of the unit cell of the crystal. The file may be searched on the basis of any of these parameters as shown in Figure 3, which is an example of a search for any compounds that crystallize in space group P 1 and have a molecular weight between 250 and 300. As can be seen from Figure 3, there are 98 entries with the correct space group (scratch file 1) and 867 with a molecular weight within the specified range (scratch file 3). The intersection of these files reveals that only 3 compounds (scratch file 4) meet both specifications, and the first of these compounds, crystal sequence number 4413, is listed.

All the compounds in this file have been registered by the CAS and their connection tables have been merged into the file. This data base is, therefore, searchable on a structural of substructural basis, as are all the other files of the CIS.

Once and entry of interest in this file has been located by one of the search programs, its file accession number, the "crystal sequency number" can be used to retrieve the appropriate literature reference or the structure, or both.

This file is available for general use via the ADP-Cyphermetics network. Currently, the charging of options in this system is not transaction-priced. Enough statistics are now available to indicate that all searches, other than structural searches, cost les than $2.00 and that the structural searches cost possibly as much as $10.00.

d. X-ray Crystal Data Retrieval System. The National Bureau of Standards (NBS) has collected a file of data pertaining to some 24,000 crystals, including those in the Cambridge file described in ( c ) above (9). The data in the NBS file include the cell parameters, the number of molecules, Z, in the unit cell, the measured and calculated densities of the crystal and two determinative ratios, such as A/B and A/C. Every compound in the file is identified by its name, molecular formula, and registry number, and the file can be structurally searched by the CIS substructure search system which is described below.

Searches through this data base for crystals with specific space groups, or densities are possible and crystals with unit cells of given dimensions can also be found. It is hoped that this may prove to be a very rapid method of identifying compounds from the readily measured crystal properties.

e. X-ray Powder Diffraction Retrieval Program. Compounds that fail to crystallize may still be examined by X-ray diffraction, because powders give characteristic diffraction patterns. A collection of powder diffraction patterns proves to be a very effective means by which to identify materials and indeed, one of the very earliest search systems in chemical analysis was based upon such data by Hanawalt (10) nearly forty years ago. The data base of some 27,000 powder diffraction patterns (11) that is used in the CIS is in fact a direct descendant of that with which Hanawalt carried out his pioneering work. A problem that arises in connection with this particular component stems from the fact that powders, as opposed to crystals, are frequently impure. The patterns that are obtained experimentally, therefore, are often combinations of one or more file entries. A reverse searching program, that examines the experimental data to see if each entry from the file is contained in it (12), has been written and seems to cope with this particular difficulty. It is currently running in test on the NIH PDP-10 and will be made available to the scientific community during 1977.

f. Substructure Search System (SSS). All the compounds in the files of the CIS have been assigned a registry number by the CAS. The registry number is a unique identifier for that compound, and may be used to retrieve from the CAS Master Registry, all the synonyms that the CAS has identified for the compound, these being names that have been used for the compound, in addition to the name used in the CAS 9th Collective Index. Further, the registry number can be used to locate in the CAS files, the connection table for the compound's structure. This is Two-dimensional record of all atoms in the molecule together with the atoms to which each is bonded and the nature of the bonds (13). This connection table is the basis of the substructure search component of the CIS (14).

The purpose of the SSS is to permit a search for a user-defined structure or substructure through data bases of the CIS. If a substructure is found to be in a CIS data base, then armed with its registry number, the user can access that file and locate the compound and hence inspect whatever data is available for it.

As the first step in this process, the user must, of course, be able to define the structure of interest to the computer. This is done with a family of structure generation programs which can, for example, create a ring of a given size, a chain of a given length, a fused ring system and so on. Branches, bonds and atoms can be added and the nature of bonds can be specified. The element represented by any nodes can be defined; in the absence of such definition, the atom is presumed to be carbon. As the query structure is developed using these commands, the computer stores the growing connection table. If the user wishes to view the current structure at any point, the display command (D) can be invoked. This command, using the current connection table, generates a structure diagram that can be printed at a conventional terminal.

When the appropriate query structure has been generated, a number of search options can be invoked to find occurrences of this query structure in the data base. The two most useful search options are the fragment probe and the ring probe. The fragment probe will search through the assembled connection tables of the data base for all occurrences of a particular fragment, i.e., a specific atom, together with all its neighbors and bonds. The user may specify particular fragments in the query structure which are thought to be fairly unique and characteristic of that structure. Alternatively, a search for every fragment in the query structure may be requested. The general form of a fragment probe is as shown in Figure 4. The query structure contains only one relatively unique node, C3, and this is the one which is sought in the data base. It is found to occur 980 times and a temporary file of just those particular entries is stored as file #2. This can be accessed by the user either for the purpose of listing its contents, as is shown in the figure, or to intersect it with other scratch files.

The ring probe search is a search through the data base for all structures containing the same ring or rings as the query structure. A ring that is considered to be an answer to such a query must be the same size as that in the query structure. It must also contain the same number of non-carbon atoms (hetero-atoms) but the nature of the heteroatoms and the position of any substituents can be required by the user to be the same or different to that in the query structure. Thus with a query structure of pyrrole, the only "exact" answer is pyrrole but the user may permit the retrieval of "imbedded" answers which would include furan and thiophene. Similarly, o-xylene itself is the only "exact" match for o-xylene, but m- and p-xylene would be considered as "imbedded" matches. An example of a ring probe search is given in Figure 5. Here the query structure is a 3,4-dichlor ofuran, but imbedded matches for heteroatoms and substituents have been allowed and so the list of 304 answers will include any disubstituted pyrrole as well as any disubstituted furan and so on. Higher substitution will also be permitted.

In addition to these structural searches, there are a number of "special properties" searches that often prove to be very useful as a means of reducing a large list of answers resulting from structure searches. The special properties searches include searches for a specific molecular weight or range of molecular weights and a search for compounds containing a given number of rings. Searches may also be conducted for the molecular formula corresponding to the query structure, or for a different user-defined molecular formula. This may be specified completely or partially and the number of atoms of any element may be entered exactly or as a permissible range.

If one's purpose is to determine only the presence or absence in a data base of a specific structure, this can be accomplished with the search option "IDENT." This program hash-encodes the query structure connection table and searches through a file of hash-encoded connection table for an exact match. The search, which is very fast by substructure search standards, has been designed specifically for those users who, to comply with the Toxic Substances Control Act (15), have to determine the presence or absence of specific compounds in Environmental Protection Agency files.

Finally, if one has completed ring probe and fragment probe searches for a specific query structure and is still confronted with a sizeable file of compounds that satisfy the criteria that were nominated, a substructure search through this file may be carried out. This involves an atom-by-atom, bond-by-bond comparison of every structure.

Figure 4. Fragment probe in the substructure search system

3. MILNE AND HELLER

Figure 5. Ring probe in the substructure search system

The substructure search system is currently operating on 19 files which are given in Table 1. The whole system is available for general use on the Tymshare computer network. A subscription fee of $150 per year must be paid for use of the system and the only other charges are the connect-time charge and the searching costs which range upwards from $3.00-5.00 for an identity search.

g. Mass Spectrometry Literature Search System. The accumulated files of the Mass Spectrometry Bulletin, a serial publication of the Mass Spectrometry Data Centre, Aldermaston, England, have been made the basis of an on-line search system.

The Bulletin, which since 1967, has collected about 60,000 citations to papers on mass spectrometry, may be searched interactively for all papers by given authors, all papers dealing with one or more specific subjects or with one or more particular elements. In addition, citations dealing with general index terms may also be retrieved. Simple Boolean logic is available, and thus searches may be conducted for papers by Smith and Jones, or Smith but not Jones, and so on. Citations retrieved may be limited to specific publication years, between 1967 and the present. The interactive nature of the search provides great control to the user. One can learn within a few minutes that while there are in the Bulletin, 463 papers dealing with mass measurement for example, and 678 on chemical ionization, only 8 report on mass measurement in chemical ionization mass spectra. Similarly, one can rapidly discover that although there are 532 papers dealing with carbon dioxide, only 1 of these was presented at the 1975 NATO meeting in Biarritz.

No numerical codes are used by the system. A search for a specific subject can be carried out by entering the subject word itself. If the word "mass" is entered, searches for 7 terms (all those containing the fragment "mas", i.e., mass spectra, mass discrimination, mass measurement and so on) are conducted and the user is asked to select the one of interest. In this way, knowledge of the correct subject words or of their correct spelling in not necessary.

The whole search system has been written for use on a PDP-10 computer and is a component of the MSDC-NIH-EPA Mass Spectral Search System. As such, it is accessible via the ADP-Cyphernetics computer network.

h. X-ray Crystal Literature Retrieval System. The data base used in the X-ray crystallographic search system described in ( c) above possesses complete literature references to all entries in the file (8). This information has been made the basis of a system for searching the literature pertaining to the X-ray diffraction study of organic molecules.

As in the Mass Spectrometry Bulletin Search System, it is possible to search for papers by a specific author (s), and papers that appeared in given years in given journals may also be retrieved. Additionally, papers may be located on the basis of specific words appearing in their titles. These words may be truncated by the user and so the fragment "ERO" will retrieve papers with the word "STEROID" on their titles or papers whose titles use the word "MEROQUININE." The system generates scratch files from searches, as in the substructure search system, and files can be intersected upon request with "AND" or "NOT" operators. Thus one could, for example, retrieve all papers published in Acta Crystallographic since 1970 by Atkins, excluding specifically those on corticosteroids.

Once a paper of interest has been identified, all the crystallographic information in that paper can be examined because the crystal serial number of the paper can be used in the

TABLE 1. COMPONENT FILES OF THE NIH-EPA

SUBSTRUCTURE SEARCH SYSTEM

Cambridge (Xray) Crystal File.

CPSC Chemicals in Consumer Products.

EPA AEROS SOTDAT File.

EPA Las Vegas Chemical Spill File.

EPA Storage and Retrieval of Air Data.

EPA Pesticide Standards.

EPA STORET Water Data Base.

EPA-FDA Pesticide Repository Standards.

EPA Inactive Ingredient in Pesticides.

EPA Oil and Hazardous Materials File.

EPA Pollutants in Drinking Water.

EPA Pesticides File.

EROICA Thermodynamics Data File.

Merck Index

NBS Gas Phase Proton Affinities.

NBS Heats of Formation of Gaseous Ions.

NBS Single Crystal File.

NCI-SRI Industrial Chemicals File.

NCI-PHS-149 File of Carcinogens.

NIMH File of Psychotropic Drugs.

NIH-EPA Carbon-13 Nuclear Magnetic Resonance Search System.

NIH-EPA Mass Spectral Search System.

WHO International Non-proprietary Name File of Drugs.

crystallographic search system to retrieve that information. Alternatively, the CAS registry number of any particular compound can be used to retrieve any data of interest on that compound from other files of the CIS.

The X-ray literature search system is operating on the ADP-Cyphernetics network as a component of the X-ray crystallographic search system. Searches are not transaction priced but cost under $2.00 on average.

I. Proton Affinity Retrieval Program. With the current high level of interest in chemical ionization mass spectrometry, there is a need for a reliable file of gas phase proton affinities. No data base of this sort has previously been assembled and for these reasons, the task of gathering and evaluating all published gas phase proton affinities has been undertaken by Rosenstock and coworker at NBS. This file (16), which has about 400 critically evaluated gas phase proton affinities drawn from the open literature, can be searched on the basis of compound type or of the proton affinity value. It will be appended to the MSSS and the bibliographic component will be merged with the Mass Spectrometry Bulletin Search System.

j. NMR Spectrum Analysis Program. Many proton nmr spectra can be satisfactorily analyzed by hand, and such first order analysis is, in these cases, a quite satisfactory way of assigning chemical shifts and coupling constants to the various nuclei involved. In certain cases, however, so-called second order effects become important and as result, more of fewer spectral lines than are indicated by first order consideration will result. A way to analyze such spectra is to estimate the various coupling constants and chemical shifts and then, using any of a variety of standard computer programs (17), calculate the theoretical spectrum corresponding to these values. The calculated spectrum can be compared to the observed spectrum and a new estimate of the data can be made. In this way, by a series of successive approximations, the correct coupling constants and chemical shifts can be determined.

The CIS component GINA (Graphical Interactive NMR Analysis) which is based upon the programs developed by Johannesen et al. (18), permits these operations in real time in an interactive fashion. The program is designed for use with a vector cathode ray tube terminal upon which each new theoretical spectrum can be display for comparison by the user with the observed spectrum. The program has been available at NIH for over four years (19) and is currently being exported to a computer network in the private sector. The cost of using the program in not yet well established because it is subject to wide variations.

k. Mathematical Modelling System(MLAB). MLAB is a program set, developed by Knott at NIH (20) which can assimilate a file of experimental data, such as a titration curve and perform on it any of a wide variety of mathematical operations. Included amongst these are differential and integral calculus, statistical analysis (mean and standard deviation, curve and distribution fitting and linear and non-linear regression analysis). Output data can be presented in any form, but PDP-10-resident program is specially powerful in the area of graphical output. Data can be displayed in the form of two- or three-dimensional plots which can be viewed and modified on a CRT terminal prior to pen-and-ink plotting.

This program set is now available for general use on the ADP-Cyphernetics network. The cost of using MLAB is based upon the computer resource units and depends, therefore, upon the type of work that is being done.

1. Isotopic Label Incorporation Determination (LABDET). Radioisotopes are particularly well-suited to labelling studies because they can be very easily detected at very low levels. In recent years, however, there has been increasing concern about the shortcomings of radioisotopes in medical research. Current standards, in fact, take the position that the use of radioisotopes such as carbon-14 in children and women of child-bearing age is precluded. Consequently, it is not possible to study the metabolism of drugs in such patients using radioisotopes, and this leads to some difficulty because it is only in such patient groups that the metabolism of drugs, such as oral contraceptives, is of relevance.

Much effort has gone into studies of the possibilities of carbon-13 as a surrogate for carbon-14, and this type of work applies also to problems involving oxygen and nitrogen which have stable isotopes, but no convenient radioisotopes. Mass spectrometry is the bes general method of detection and quantitation of stable isotopes in molecules, but there a serious problem involved in its application is that stable isotopes such as carbon-13, nitrogen-15 and oxigen-18 occurr naturally as minor component of natural elements. This is most pronounced in the case of carbon. Naturally-occurring carbon is about 99% C-12 but a small variable amount of all natural carbon is C-13. This creates a "background" against which determinations of isotope levels in labelled compounds must be measured. The purpose of LABDET (21) is to compare the mass spectrum of an unlabelled compound with that of the same compound isotopically enriched. This is usually done using the molecular ion region of the spectra. The program calculates an estimate of the level of incorporation of isotope and then calculates a theoretical spectrum is then adjusted and a further comparison is made, and in this way, the program proceeds through a predetermined number of iterations, finally calculating the correlation coefficient between the observed spectrum and the best theoretical spectrum.

This calculation is not difficult so much as tedious and if one must carry it out many times per day, use of the computer is indicated. LABDET is an option within MSSS on the ADP-Cyphernetics network and its use cost $2.00.

m. Conformational Analysis of Molecules in Solution. A problem of long standing in chemistry has been to estimate the relationship between the conformation of a molecule in the crystal, as measured by X-ray methods, with that in solution where barriers to rotation are greatly reduced. A sophisticated program set for Conformation Analysis of Molecules in Solution by Empirical and Quantum-mechanical methods (CAMSEQ) has been develop for this purpose by Hopfinger and coworkers (22) at Case Western Reserve University.

This program can run in batch or interactively. As input data, it requires the structure of the compound and this can be provided as a set of coordinate data from X-ray measurements, it can be entered interactively in the form of a connection table or the program can simple be provided with a CAS registry number, and if the corresponding connection table is in the files of the CIS, it will use that.

The first task is to generate the coordinate data corresponding to a particular compound. Then the free energy of this conformation in solution is calculated, next the program begins to change torsion angles specified by the user in the conformation and with each new conformation, a statistical thermodynamic probability is calculated, based upon potential (steric, electrostatic, and torsional) functions and terms for the free energy associated with hydrogen-bonding, molecule-solvent, and moleculedipole interactions. This program, in its interactive version, can be run in under 40K words of core, and work is in progress to export it to a commercial networked computer.

Conclusions

One of the first goals of the CIS was to produce a series of searchable chemical data bases for use by working analytical chemists with no especial computer expertise. A second aim was to link these data bases together so that the user need not be restricted to a consideration of, for example, only mass spectral data.

The various problems inherent in these plans included acquisition of data bases, design of programs, dissemination of the resulting system and linking, via CAS registration numbers, of the various CIS components. These problems, as has been described above, have been solved conceptually and, to a large extend, practically, and the CIS, as it now stands, is the result. It is now possible, therefore, to review the system in an effort to define future goals, and a number of these seem fairly clear.

Searches through more than one data base in combination would be desirable. For example, one often possesses mass spectral and nmr data for an unknown, and it would very useful to be able to identify any compounds that match these data in a single search. In another development, it is expected that the CONGEN program developed for the DENDRAL project (23) will be merged into CIS during the coming year. This program, which generates structures corresponding to a specific empirical formula, could be extremely useful in a strategy for structure solving using the CIS. It is not all difficult to envisage situations in which a reduced set of structures could be produced by CONGEN for consideration. Each structure in turn could be used as an input in the substructure search system, and the various compounds whose registry numbers are so retrieved could be considered to be possible answers to the problem. Confirmation for any of them could then be sought in the spectral data bases, the registry number being all that is necessary to locate and retrieve data. One can even speculate further to the day when synthetic pathway to any likely candidates could be designed by the computer system which could easily add the very practical touch of checking that any starting materials for such syntheses are commercially available at an appropriately low cost!

In a different approach, the power of pattern recognition techniques could be assessed within some of the very large files contained in the CIS. This is very useful exercise because there is little reported work of this sort on large files and thus we have begun to explore the value of such methods in handling the problem of identification of true unknowns such as water pollutants. Programs designed to test mass spectra for the presence of the compound of oxygen or nitrogen are currently being tested (24) and their utility as prefilters on mass spectral data prior to data base searching will be tested as soon as feasible.

In summary, it is felt that progress to date with the CIS has demonstrated economic feasibility in that a number of relatively stable CIS components have now been in the private sector for some time. The test before us is whether we can capitalize on this to explore the new and exciting possibilities that lie ahead in the area of structure determination by computer.

Literature Cited

1. Heller, S. R., Milne, G. W. A., and Feldmann, R. J. Science, (1997) 195, 253.

2. Heller, S. R., Fales, H. M., and Milne, G.W. A. Org. Mass Spectrom., (1973) 7, 107; Heller, S. R., Koniver, D. A., Fales, H. M., and Milne, G. W. A. Anal. Chem. (1974) 46, 947; Heller, S. R., Feldmann, R. J., Fales, H. M., and M ilne, G. W. A. J. Chem. Soc., (1973) 13, 130; Heller, R. S., Milne, G. W. A., and Heller, S. R. J. Chem. Inf. Comp. Sci. (1976) 16, 176.

3. This data base (ref. 8) is leased by NIH on behalf of the entire U. S. scientific community.

4. McCarthy, G. And Johnson, G. G., paper C3 presented as a part of the Proccedings of the American Crystallographic Association meeting, State College, PA., 1974.

5. Heller, S. R., Milne, G. W. A., and Feldmann, R. J. J. Chem. Inform. Comp. Sci. (1976) 16, 132.

6. This is carried out using an unpublished program developed by McLafferty and co- workers at Cornell University.

7. Schwarzenbach, R., Milne, J., Koenitzer, H., and Clerc, J. T. Org. Magn. Res. (1976) 8, 11.

8. Kennard, O., Watson, D. G., and Town, W. G. J. Chem. Doc. (1972) 12, 14.

9. These data are available as NBS tape #9 through the National Technical Information Service, Springfield, VA 22151.

10. Hanawalt, J. D., Rinn, H. W., and Frevel, L. K. Ind. Eng. Chem. (Anal.) (1938) 10, 457.

11. This file is a proprietary product of the Joint Committee on Powder Diffraction Standards, 1601 Park Lane, Swarthmore, PA 19801.

12. Abramson, F. P. Anal. Chem. (1975) 47, 45.

13. Chemical Abstracts Service Standard Distribution Format File, 1976. Chemical Abstracts Service, Columbus, OH 43210.

14. Feldmann, R. J., Milne, G. W. A., Heller, S. R., Fein, A., Miller, J. A., and Koch, B. A. J. Chem. Inf. & Comp. Sci., (1977) in press.

15. Toxic Substances Control Act, Public Law 94-469, enacted October, 1976.

16. Hartmann, K., Lias, S., Ausloos, P. J., and Rosenstock, H. M. Publication NBSIR 76-1061, July, 1976.

17. Castellano, S. and Bothner-By, A. A. J. Chem. Phys. (1964), 41, 3863; Swalen, J. D. and Reilly, C. A. J. Chem. Phys. (1965) 42, 440.

18. Johannesen, R. B., Ferreti, J. A., and Harris, R. K. J. Magn. Res. (1970) 3, 84.

19. Heller, S. R. and Jacobson, A. E. Anal. Chem. (1972) 44, 2219.

20. Knott, G. D., and Shrager, R. I. Assn. Comp. Machin. SIGGRAPH Notes 6 (1972) 138.

21. Hammer, C. F., Department of Chemistry, Georgetown University, Washington, DC, unpublished work.

22. Weintraub, H. J. R., and Hopfinger, A. J. Int'l. J. Quant. Chem. (1975) 9, 203.

23. Carhart, R. E., Smith, D. H., Brown, H., and Djerassi, C. J. Amer. Chem. Soc. (1975) 97, 5755.

24. Meisel, W., Jolley, M., and Heller, S. R., in preparation.