George W. A. Milne 
National Institutes of Health,
Bethesda, MD 20014
M. L. Melley and 
Stephen R. Heller
Environmental Protection Agency
Washington, DC 20460
INTRODUCTION
A major activity in modern chemistry is the identification of chemical substances from laboratory measurements made on these substances. Whatever measurement technique is used, the task generally devolves into one of recognizing the 'fingerprint' given by the unknown, when thousands of fingerprints of known compounds are available. The NIH/EPA Chemical Information System permits 'fingerprint recognition' in a variety of efficient and inexpensive ways and is used very heavily in this manner by scientists all over the world.
A much more challenging task now dominating CIS development is the prediction of a substance's properties and behavior from its molecular structure. The short-term promise of such predictive ability is a tremendous savings in resources; large numbers of expensive and time-consuming laboratory measurements can be obviated by strategies in which the properties Of all substances in a set can be selected experimental measurements.
Finally, as a long term goal of the CIS development effort, an understanding of the relationships between structure and properties is beginning to flow from studies facilitated by the access to the large, evaluated high-quality numeric or source data bases contained in the CIS.
This presentation will center on the first or the above capabilities of the CIS, namely the use of the CIS spectral data bases.
The quantity of data associated with chemistry has been expanding in the recent decades, but until the recent advent of third generation coup users (integrated circuitry computers), handling and using this vast amount of information has been an insuperable problem. with modern computer technology and electronics, the costs of computation has come down, while access to computers has increased through the use cf computer networks, accessible over standard telephone lines. With this as background, we have been developing a highly interactive, disk-oriented chemical information system of numerical data. This system is readily and inexpensively available to our can Agencies' laboratories, and cur contractors, grantees, and scientific collaborators as well as the general public.
The early computer search systems minimized the high cost of mass storage by maintaining data files on magnetic tape rather than drums or disks. Data bases can be stored very inexpensively on tape, but can only be searched sequentially and this is inevitably a slow process. Magnetic disks, on the other hand, are random access devices, capable cf storing a great deal of data which can be accessed and searched very rapidly. Until recently, the costs of disks, controllers and the other necessary items to use such equipment teas precluded their use for large data bases. These costs have decreased markedly in recent years however, and since the use of disk for data storage permits interactive computing, the Chemical Information System uses disk exclusively for the storage of data.
Interactive computing is a significantly different process from batch, off-line, or even cr.-line computing and a different philosophy can be used in the design of programs for such work. A major problem in searching a chemical data base, is that the best questions are Often not known. An interactive system can provide the answer to a question immediately and this will enable
the user to see the deficiencies in the question and to frame a new query. In this way, there can be built a feedback loop in which the scientist acts as a transducer, "tuning" the query until the system reports precisely what is required. The NIH/EPA Chemical Information System, described here in some detail, has been designed around this general approach.
SYSTEM DESIGN
The NIH/EPA CIS consists of a collection of chemical data bases together with a battery of computer programs for interactive searching through these disk-stored data bases. In addition the CIS has a data referral capability as well as a data analysis software system. It can be thought of then, as having four main areas:
The numeric data bases, or scarce data bases, include files of mass spectra [ 2 ], carbon-13 nuclear magnetic resonance [3], X-ray diffraction data for crystals [4] and powders [5,6] , mammalian acute toxicity data [7], and aquatic toxicity data [8]. There are reference or bibliographic data bases associated directly with the mass spectrometry, X-ray crystallography and nuclear magnetic resonance spectroscopy and these have been included within the CIS [9]. The analytical programs include a family of statistical analysis and mathematical modeling algorithms [10], programs for the second order analysis cf nor spectra [11], and energy minimization of conformational structures [12]. Programs that design chemical syntheses are being tested and may, if viable, become part of the CIS in the future [13].
The center or "hub" cf the CIS is the Structure and Nomenclature Search System (SANSS) [14], which allows the user to search through data bases of structures (such as those associated with collections of mass spectra) for occurrences of a specific structure or substructure. With this program, for examples regulatory Agencies considering the problem of collecting data on aromatic bromo-chloro compounds could proceed as follows: The substructure search shown in Figure 1 could be conducted and this would find all occurrences of BrCl compounds in the 40 data bases searched. In turn, by reference to the Toxic Substances Control Act (TSCA), International Trade Commission (ITC), Resource Conservation and Recovery Act (RCRA), etc., this would lead to information such as the number cf chemicals involved, the dollar volume of chemicals affected, and sc on. If necessary, a subset of these chemicals could be defined and investigated in further detail.
In the area of structure elucidation, if one had evidence that an unknown contained a particular substructure, a search might reveal that there were NMR spectra to compare with such a similar structure, but no IR spectra, suggesting that an NMR spectrum would be more useful than an IF spectrum. in attempts to identity the unknown.
As more and more data bases have been collected and merged into the SANSS, it has become a catalog of files that contain certain chemicals [15]. Recently the structure of the SANSS files were reorganized so that this referral capability, which uses an integrated data base of the forty files shown in Figures 2a and 2b, is much mere efficient.
The entire CIS structure can be viewed, as shown in Figure 3, as "a wheel" of independent numerical data bases, linked together through the SANSS "hub", using the Chemical Abstracts Service Registry Number (CAS REGN) as the unique universal chemical identifier for each compound. lie use of the CAS REGN to tag all CIS files, was codified in EPA regulation #2800.2 in 1975 [16]. With the passage of TSCA in late 1976, the use of the CAS REGN was extended to the TSCA inventory and thus a link between regulatory data and scientific data has been established. In Figure 3, the solid circles represent publicly accessible CIS components running on a commercial computer. The solid circles represent systems which are currently being put through their final testing at NIH before they are considered operational and placed in the commercial system. The dotted triangles are systems under development. Finally, the dotted lines to the solid circles refer to operational systems that the CIS can link to, but that are on other computers on the Telenet network [33], which is used by the CIS.
CIS SYSTEM DEVELOPMENT
A general protocol for updating of CIS components or the addition to the CIS of new components has been established and a schematic diagram of this protocol is shown in Figure 4.
In the first phase, a data base is acquired from one of a variety of sources. Some of the CIS data bases have teen developed specifically for the CIS, an example of this being the mass spectral data base [2]. Others, such as the Cambridge Crystal File [4], are leased for use in the CIS and others, such as the X-ray powder diffraction files, are operated within the CIS by their owners, in this case the JCPDS - International Centre for Diffraction Data.
In other cases, the information comes from other Government Agencies which retain responsibility for the file, its contents and its Maintenance. An example of such a file is the NIOSH RTECS.
It the data base is to be made searchable, some reformatting, sorting and inversion of files is usually required and this is carried out en the NIH IBM 360-168, which is well-suited to processing large files of data. Once inverted lists have been prepared, they are transferred to the NIH PDP-10 computer which is primarily a time-sharing computer, and the programs for generating the searchable files and for searching through these files are written. Analytical, data base-independent programs of the CIS are usually written entirely on the PDP-10.
Out of this work, there finally emerges a pilot version of each CIS component. This pilot version is then allowed to run on the NIH PDP-10 and access to it is provided to a small number of people who can log into the NIH computer by telephone, Using long distance calls if necessary. These users are provided with free computation and in return, they test the component thoroughly for errors and deficiencies. Such problems are reported to the development team, which attempts to deal with them. Depending upon the size and complexity Of the component, this testing phase can last as long as eighteen months.
Then testing is complete, the entire component is exported to a networked PDP-10 in the private sector and the version en the NIH computer is ro longer maintained. The component in the private sector is available to the general scientific community, including Government Agencies, and is used on a fee-for-service basis. In this phase, the Government retains no financial interest in the component; it is 'managed' by a sponsor outside the U.S. Government. The Netherlands Information Combine in Holland, for example, maintains the Carbon-13 Nuclear Magnetic Resonance (CNMR) search system on the network. The U.C. Government does not subsidize the routine operation of CIS components in the private sector. In fact, various Government agencies of the Government are actually users of the CIS and they pay, like any other user, according to their use of the system. Charges for use of these components must be designed to cover costs, and if the component attracts insufficient use at these prices, then it may not be viable and its sponsor is free to discontinue its support.
COMPUTER FACILITIES USED BY THE CIS.
Programs of the CIS have usually been designed for use with a DEC PDP-10 computer system. The reason for this is that the PDP- 10 is one of the better time-sharing systems available and has been adopted by a number of commercial computer companies as the main vehicle for their network operations. Transfer of a program from the NI H DP-10 to a network PDP-10 is usually fairly straightforward, and use of a networked computer is favored because the alternative philosophy of exporting programs and data bases to locally operated PDP-10 computers is less workable and contains a number of deficiencies that are overcome by a network. Most important among these is the fact that use of a networked machine means that data bases need only be stored only once, at the center of the network. A great deal cf money is thus saved because duplicate storage is not necessary. Further, a single copy of a data base is easy to maintain, whereas updating of a data base that resides on many computers is virtually impossible. Finally, communications between systems personnel and users is very simple in a network environment, as is monitoring of system performance.
For these and other reasons, the policy of disseminating the CIS via a networked EDP-10 computer was adopted at the outset and has proved to be guise successful. A typical American network of this sort has something under 100 nodes - i.e. local telephone call access is available in about 100 locations. These are mainly in the U.S., but a substantial number may be found in
Europe. Further, some computer networks are now themselves interfaced to the Telex network, thus making their computer systems available worldwide. Irrespective of one's location, the cost of access to the CIS is either $36 cr $60 per hour [depending upon which comportment is chosen) in the USA and Canada. Outside North America, the cost is $24 cr $48 per hour plus the local telephone company/PTT chances for connection to the Telenet network. These latter costs vary from about S15 to $30 per hour, depending on country and possibly, the location within a particular country.The only equipment that is required to establish access to a computer network, is a telephone-coupled computer terminal. Typewriter terminals are becoming very common and are also becoming relatively inexpensive. Such a terminal can be purchased from a variety of manufacturers for between $1,000 and $3,000 and in general, will operate at 300 baud (30 characters/second). A cathode ray terminal, capable of running at 1200 baud can be purchased for as little as $2,000. Any equipment of this sort can usually te leased or purchased.
COMPONENTS OF THE CIS.
a.. Mass Spectral Search System (MSSS)
The Mass Spectral Search System (MSSS) is the cadent component of the CIS. Developed as a joint effort between NIB. EPA, NBS (National Bureau of Standards) and the Mass Spectrometry Data Centre (MSDC) in England, the current MSSS data hare contains about 31,600 mass spectra representing the same number of compounds. This has keen derived from an archival file containing some 60,000 spectra Of the same 31,600 compounds [17]. Computer techniques have been employed to assign every spectrum a quality index [18] and where duplicate spectra appear in the archive file, only the best spectrum is used in the working file. All compounds in the archive have been assigned Chemical Abstracts Service (CAS) registry number, a unique identifier that is used to locate duplicate entries for the compounds, find the compound in other CIS files and provide structure and synonym lookup capabilities throughout the CIS.
Searches through the MSSS data base can be carried cut in a number or ways. With the Mass spectrum of an unknown in hand, the search can be conducted interactively, as is shown in Figure 5. In this search the user finds that 89 data base spectra have a peak [minimum intensity 6CI, maximum intensity 1001) at an m/e value of 272. When this subset is examined for spectra containing a peak at /e 237 with intensity of between 10 and 70K, only 6 spectra are found. The entering of a third peak, at m/e value of 3S7 {with an intensity between 51 and 301) narrows the search down to just one answer, which is then printed cut. In the example shown the answer, Kepone, is shown with a number of synonyms used in naming this chemical, as well as other identifying information. If there still had been a large number of answers after entering the three peaks used in this example, the search could have been reduced further to a manageable number of spectra by entering further peaks. In addition, the data base can be examined for all occurrences of a specific molecular weight or a partial or complete molecular formula. Combinations of these properties can also be used in searches. Thus all compounds containing for example, five chlorines and whose mass spectra have a base peak at a particular m/e value can be identified.
In contrast to these interactive searches, which are of little appeal to those with large numbers of searches to carry out, there are available two batch-type searches which accept the complete spectrum of the unknown and examine all spectra in the file sequentially to find the best fits. These are the Biemann and PBM search algorithms. Spectra can be entered from a teletype, but in a more corn arrangement, a user's data system can be connected to the network for this purpose and the unknown spectra can be down-loaded into the network computer for use in the search.
Once an identification has teen made, the name and registry number of the compound found are reported to the user. If necessary, the data base spectrum can be listed or, if a CRT terminal is being used, plotted, to facilitate direct comparison of the unknown and standard spectra.
Also within the MSSS are the accumulated files of the Mass Spectrometry Bulletin, a serial Publication of the Mass Spectrometry Data Centre, UKCIS, Nottingham, England. The Bulletin, which since 1967, has collected about 60,000 citations to papers on mass spectrometry, nay be searched interactively for all papers by given authors, all papers dealing with one or core specific subjects or with one cr mere particular elements [27]. In addition, citations dealing with general index terms nay also be retrieved. Simple Boolean logic is available, and thus searches may be conducted for papers by Smith and Jones, or Smith but not Jones, and so on. Citations retrieved may be limited to specified publication years, between 1967 and the present. The interactive nature of the search provides great control to the user. One can learn within a few minutes that while there are in the Bulletin 463 papers dealing with mass measurement for example, and 678 on chemical ionization, only 8 report on mass measurement in chemical ionization mass spectra. Similarly, one can rapidly discover that although there are 532 papers dealing with carbon dioxide, only 1 of these was presented at the 1975 NATO meeting in Biarritz.
No numerical codes are used by the system. A search for a specific subject can be carried out by entering the subject word itself. If the word 'mass' is entered, searches for 7 terms (all those containing the fragment 'mas' i.e. mass spectra, mass discrimination, mass measurement, etc.) are conducted and the user is asked to select the one of interest. In this way, knowledge of the correct subject words or of their correct spelling is not necessary.
With the current high level of interest in chemical ionization mass spectrometry, there is a need for a reliable file of gas phase proton affinity. No data base of this sort has previously been assembled and for these reasons, the task of gathering and evaluating all published gas phase proton affinities has been undertaken by Rosenstock and co-workers at NBS. This file [28], which has about 400 critically evaluated gas phase proton affinities drawn from the open literature, can be searched on the basis of compound type or the proton affinity value.
The MSSS has been widely available through computer networks since 1971 and is currently resident upon the Interactive Sciences Corporation (ISC) computer where, every month, over 3000 searches and 2,000 ether transactions, such as retrievals, are carried out by over 300 laboratories. Use of the MSSS is fixed at $36 per hour, in addition to a yearly $300 CIS subscription fee, which allows users access to all CIS components. The $36 per hour charge translates to about $0.50 to $0.60 per transaction within the MSSS.
b. Carbon-13 Nuclear Magnetic Resonance (CNMR) Spectral Search System
The data base that is used in the CNMR search system consists currently of 6,70C CNMR spectra. As in the case of the MSSS, every compound has a CAS registry number, and all duplicative spectra have been removed from the file. A specific compound may still appear in this file more than once, however, because its CNMR spectrum may have been recorded, for example, in different solvents. The CNMR file is still small but is growing at a fairly steady rate and should benefit considerably from recent international agreements to the effect that all major compilations of CNMR data will, in the future, be pooled [32].
Searching through this data base, as in the case of the MSSS, can be interactive or not t3] In the interactive search, a user enters a shift, with an acceptable deviation, and the single frequency off-resonance decoupled multiplicity, if that is known. The algorithm reports the number of file spectra fitting one or both of the criteria entered. The names of the compounds whose spectra have been retrieved can be listed, or alternatively, the list can te reduced by the entry of a second chemical shift. A search for spectra of compounds having a specific molecular formula can also be carried out, but there is no capability for searching on molecular weight, a parameter of little relevance to CNMR spectroscopy.
If an interactive search is not appropriate to the problem at hand, a batch type cf search through the data base is available using the techniques described by Clerc et al. [19]. To institute such a search, the user enters the all the chemical shifts from the unknown and Starts the search. The entire unknown spectrum is compared to every entry in the file and the best fits are noted and reported to the user. This program searches for the absence of peaks in a given region as well as for the presence of peaks and thus has the capability of finding those compounds which are structurally similar to the material that gave the unknown spectrum.
When a search is completed, the user is provided vita the accession numbers of spectra that match the input data. The names and CAS registry renumbers cf the compounds in question will also be given. If more information is required, the complete entry for a given accession number can be retrieved. This includes a numbered structural formula, the name, molecular formula and registry number cf the compound, experimental data pertaining to the spectrum and the entire spectrum, together with single frequency off-resonance decoupled multiplicities and, when available, the relative line intensities and assignments.
This CNMR search system has available as part of the CIS at the rate of $36 per hour, in addition to the $300 yearly CIS subscription fee.
c. X-ray Crystallographic (CRYST) Search System
This is a series of search programs working against the Cambridge Crystal File [4] a data base of some 15,000 compounds for which full atomic co-ordinate data are available, and over 27,000 bibliographic Entries dealing with published crystallographic data, mainly for organic compounds. The entry for each compound contains the compound name, its molecular weight and CAS Registry number, the space group in which it crystallizes and the parameters of the unit cell of the crystal as well as the atomic co-ordinate data. The file may be searched on the basis of any of these parameters as shown in Figure 6, which shows search for any compounds that crystallize in space group P 1 and have molecular weights between 250 and 300. as can be seen, there are 133 entries with the correct space group (temporary file 1) and 2038 with molecular weight between 250 and 300 (temporary file 1). The intersection of these files reveals that only 21 compounds (temporary file 3) meet both specifications, and the first of these compounds, crystal sequence number 849, is listed.
All the compounds in this file have been registered by the CAS and the CAS data is currently being merged into the CRYST system. This data base is therefore searchable on a structural or substructural basis, as are all the other files of the CIS.
Once an entry of interest in the Cambridge X-ray file has been located by one of the search programs, its 'crystal sequence number' can be used to retrieve the appropriate literature reference, structure, or co-ordinate data or both.
The data base used in the X-ray crystallographic search system described in (c) above possesses complete literature references to all entries in the file [4]. This information teas been made the basis of a system for searching the literature pertaining to the 1-ray diffraction study of organic molecules.
As in the Mass Spectrometry Bulletin Search System, it is possible to search for papers by a specific author or authors, and papers that appeared in given years in given journals may also be retrieved. Additionally, papers may be located en the basis of specific words appearing in their titles. These scads may be truncated by the user and so the fragment 'ERO' will retrieve papers with the word 'STEROID' in their titles or papers whose titles use the word 'MEROQUINE''. The system generates temporary files face searches, as in the SANSS, and files can Le intersected upon request with 'AND' or 'NOT' operators. Thus one can, for example, retrieve all papers published in Acta Crystallographica since 1970 by Atkins, excluding specifically those on corticosteroids.
Once a paper of interest has been identified, all the crystallographic information ir that paper can be examined because the crystal serial number of the paper can be used in the crystallographic search system to retrieve that information. Alternatively, the CAS Registry number cf any particular compound can be used to retrieve any data of interest on that compound from other files of the CIS.
As with the other CIS components above, the X-ray data and bibliographic file is available for general use via the ISC computer for $36 per hour, in addition to the CIS subscription fee.
d. X-ray Crystal Data Search System
The National Bureau of Standards (NBS) has collected a file of data pertaining to some 24,000 crystals, including those in the Cambridge file described above [20]. The data in the NBS file include the cell parameters, the number (z) of molecules in the unit cell, the measured and calculated densities of the crystal and two determinative ratios, such as A/E and A/C Abut no co-ordinate data). Every compound in the file is identified by its name, molecular formula and CAS Registry number, and the file can be structurally searched by the CIS structure and nomenclature search system as is described below.
Searches through this data bass for crystals with specific space groups, or densities have been developed and are in the testing phase. It is possible to locate crystals with reduced unit cells calculated from the measured cell dimensions. It is hoped that this may prove to be a very rapid method cf identifying compounds from the readily measured crystal properties. The single crystal system is expected to be operational in early 1979.
e. X-ray Powder Diffraction Search Match (PDSM) System
A collection of powder diffraction patterns proves to be a very effective means by which to identify materials and indeed, one of the very earliest search systems in chemical analysis was based upon such data by Hanawalt [21] over forty years age. The importance of these data in TSCA can he seen by examining the TSCA Inventory regulations for treatment of confidential chemicals [22]. Section 710.7 of these regulations indicates that EPA intends to rely on powder diffraction data to assure the validity and seriousness of a manufacturer's request for treating information on a chemical as confidential.
The data base of some 27,000 powder diffraction patterns that is used in the CIS [5] is a direct descendant of that with which Hanawalt carried out his pioneering work. A problem that arises in connection with this particular component stems from the fact that powders, are frequently mixtures and so the patterns that are obtained experimentally are often combinations of one or more file entries. A reverse searching program it], that examines the experimental data to see if each entry from the file is contained in it , has been written after the general approach of Abramson [23], and seems to cope with this particular difficulty. It was released in December, 1978 for general use as a CIS component. The cost cf using PDSM is $60 per hour in addition to the $300 yearly CIS Subscription fee, and the program is now being used frequently.
f. NIOSH RTECS Search System
The National Institute for Occupational Safety and Health (NIOSH), created in 1970, is required by law to prepare a list of containing all the toxic effects of chemicals that can be found to have been recorded [24]. The Registry of Toxic Effects cf Chemical Substances (RTECS) is the data base created and updated annually by NIOSH to comply with this law. In 1977 the data base consisted of some 25,000 chemicals and the toxicity associated with each of these chemicals.
The NIOSH RTECS is the first ncn-spectroscopic CIS data base and has proven to be a very valuable addition to the CIS. Interest in the data base has been Shown by many groups within EPA involved in the implementation of TSCA, and as an example, work is now underway to link spectral data with the NIOSH toxicity data so that as a result of a mass spectral identification, the EPA laboratory involved can quickly he informed if the chemical identified is toxic and hence requires immediate action.
The RTECS data base can be searched in a number of ways, including NIOSH number, CAS Registry number, type of animal tested, route of dosage, LDC0, IC50, etc. The NIOSH RTECS file is also linked to the SANSS so that structure-activity correlation work can be performed.
An example of a NIOSH RTECS search is shown in Figure 7. In t his example, a search is being performed for the three oral rodent LD50 toxicity data with values less than 75 ug./kg. at the bottom of Figure 7 are listed these three references, with the NIOSH number, toxicity data and the literature citation for each measurement
g. Structure and Nomenclature Search System (SANSS)
All the compounds in the files of the CIS have been assigned a registry number by the CAR. Tie registry number is a unique identifier for that compound, and may be used to retrieve from the CAS Master Registry of over 4,000,000 entries, all the synonyms that the CAS has identified for the compound, in addition to the name used in the CAS 9th Collective Index. Further, the registry number can be used to locate in the CAS files, the connection table for the compound's structure. This is a two-dimensional record of all atoms in the molecule to-tether with the atoms to which each is banded and the nature of the bonds L 2s ]. The connection table is the basis of the substructure search component of the CIS [25].
The purpose of the SANSS, an can be seen from Figure 8, is to permit a search for a user-defined structure or sub-structure through data bases of the CIS. If a sub-structure is found to be in a CIS data base, then, armed with its CAS Registry number, the user can access that file and locate the compound and hence retrieve whatever data are available for it.
There are a number of ways to search the CIS Unified Data Base. The main ones are:
1. Name Fragment Name Search (NPROBE)
2. Nucleus/Ring Search (RPROBE)
3. Fragment Search (FPROBE)
4. Structure Code Search (SPROBE)
5. Molecular Heights Molecular Formula, Partial Formula
6. Total Atom-by-atom, Bond-by-bond search (SUBSS) 
7. Total or Full Structure Search (IDENT)
While structure searching is very important and cannot be replaced by other methods, the ability to search for a chemical by name or partial name (NPROBE), is quite useful in many cases. In particular, if one wishes to search for a drug or pesticide, all of which have simple and short trivial names, a name search is very useful, because many biologically important chemicals have complex polycyclic structures, which are difficult to draw. In the example shone in Figure 9, a name search is conducted for the carcinogen Dioxin. The program then is asked (using the SSHOW command) to print out the files in which this one chemical containing the name fragment Dioxin appears, along with its molecular formula, structural diagram and correct
Chemical Abstracts Index name, as well as the synonyms associated with the chemical.
As the first step in a structure search, the user must define the substructure of interest to the computer. This is done with a family of structure generation programs which can for example, create a ring of a given size, a chain of a given length, a fused ring system and so on. Branches, bonds and atoms can be added and the nature of bends and atoms can be specified. In the absence of a definition, an atom is presumed to be carbon. As the query structure is developed using these commands, the computer stores the growing connection table. If the user wishes to view the current structure at any point, the display command (D) can be invoked. This con~and, using the current connection table, generates a structure diagram similar to those in Figures 10, 11 and 12. This can be printed at a conventional terminal.Then the appropriate query structure has been generated, a number of search options can be invoked to find occurrences of this query structure in the data base. The two most useful search options are the fragment Probe and the ring prose. The fragment probe will search through the assembled connection tables of the data base for all occurrences of a particular atom-centered fragment, i.e. a specific atom, together with all its neighbors and bonds. The user may specify particular fragments which are thought to be fairly unique and characteristic of the query structure. Alternatively, a search for every fragment in the query structure may be requested. A fragment probe is shown in Figure 10. The query structure contains only one relatively unique node, C2, and this is the cue which is sought in the data base. It is found to occur 229 times and a temporary file of just these particular entries is stored as file #3. This can be accessed by the user either for the purpose of listing its contents, as is shown in the figure, or to intersect it with other scratch files.
The ring probe search is a search for all structures in the data base containing the same ring or rings as the query structure. A ring that is considered to be an answer to such a query must be the same size as that in the query structure. It must also contain at least as many non-carbon atoms (heteroatoms) as the query structure, but the nature of the heteroatoms can te required by the user to be the same al different to that in the query structure. The type cf bending is not considered in an RPROBE search. Thus with a query structure of furan, the only 'exact' answer is furan but the user may permit the retrieval of 'imbedded' answers which would include furan, tetrohydrofuran and thiophene. An example of a ring probe search is given in Figure 11. Here the query structure is a 1,4-dichlorofuran, hut imbedded matches for heteroatons and substituents have been allowed and so the list of 304 answers will include any disubstituted pyrroles as well as any disubstituted furans and so on. A higher degree of substitution will also be permitted.
In addition to these structural searches, there are a number of 'special properties' searches that Often prove to te very useful as a means of reducing a large list of answers resulting from structure searches. The special properties searches include searches for a specific molecular weight or range of molecular weights and a Search for compounds containing a given number of rings of a given size. Searches may also be conducted for the molecular formula corresponding to the query structure, of for other, user-defined molecular formulas. This may be specified completely or partially and the number of atoms of any element may be entered exactly or as a permissible range.
If one's purpose is to determine only the presence or absence in a data base of a specific structure, this can te accomplished with the search option 'IDENT', as is shown in Figure 12. This program hash-encodes the query structure connection table and search through a file of hash-encoded connection table for an exact match. The search, which is very fast by substructure search standards, has been designed specifically for those users who, to comply with the Toxic Substances Control Act [26], have to determine the presence or absence of specific compounds in Environmental Protection agency files.
Finally, if one has completed ring prose and fragment probe searches for a specific query Structure and is still confronted with a sizeable file of compound that satisfy the criteria that were nominated, a sub-structure search through this file may be carried out. This involves an atcm-by-atom, bond-by-bond comparison of the query structure with each structure in the file will retrieve any compound in which the query structure is imbedded.
The structure and nomenclature search system is the center of the CIS and operates on a unified data base of 40 files which are given in Figures 2a and 2b. The SANSS data bases are in the process of being updated with an additional 55 files, and a further 40-45 files are now being processed, including the list of some 20,000 chemicals covered by the Japanese Toxic Substances law. The 55-file update (which will bring the number of files in SANSS to 95) is scheduled for the spring of 1979. The whole system is available for general use on the ISC computer.
h. NMR Graphical Interactive Spectrum Analysis (GINA) Program
Many proton nmr spectra can be satisfactorily analyzed by hand, and such first order analysis is, in these cases, a quite satisfactory way of assigning chemical shifts and coupling constants to the various nuclei involved. In certain cases however, second order effects become important and as a result, more or fever spectral lines than are indicated by first order considerations will result. A way to analyze such spectra is to estimate the various coupling constants and chemical shifts and then, using any of a variety of standard computer programs [11], calculate the theoretical spectrum corresponding to these values. The calculated spectrum can be compared to the observed spectrum and a new estimate of the data can be made. In this way, by a series of successive approximations, the correct coupling constants and chemical shifts can be determined.
The CIS component GINA (Graphical Interactive Nmr Analysis) which is based upon the programs developed by Johannesen et al. [29], permits these operations in real time in an interactive fashion. The program is designed for use with a vector cathode ray tube terminal upon which each new theoretical spectrum can be displayed for comparison by the user with the observed spectrum. The program has been available at NIB for ever four years and is currently being exported to a computer network in the private sector. The cost of using GINA sell be $60 per hour.
i. Mathematical Modeling System (MLAB)
MLAB is a program set, developed by Knott at NIH [9], which can assimilate a file of experimental data, such as a titration curve, for example, and perform en it any of a wide variety of mathematical operations. Included amongst these are differential and integral calculus, statistical analysis (mean and standard deviation, curve and distribution fitting and linear and non-linear regression analysis). Output data can be presented in any form, but the PDP-10-resident program is especially powerful in the area of graphical output. Data can be displayed in the form of two- or three-dimensional plots which can be viewed and modified on a CRT terminal prior to pen-and-ink plotting. The cost of using MLAB on ISC is $60 per hour.
j. Conformational Analysis of Molecules in Solution (CAMSEQ-2)
A problem of long standing in chemistry has been to estimate the relationship between the conformation cf a molecule in the crystal, as measured by x-ray methods, with that in solution where barriers to rotation are greatly reduced. A sophisticated program set for Conformational Analysis of Molecules in Solution by Empirical and Quantum--mechanical methods (CAMSEQ-2) has been developed for this purpose by Hopfinger and co-workers [12] at Case Western Reserve University.
This program can run interactively or in a batch mode. As input data, it requires the structure of the compound and this can be provided as a set of coordinate data from X-ray measurements. Alternatively, it can be entered interactively in the form of a connection table cr the program can simply be provided with a CAS registry numbers and if the corresponding connection table is in the files of the CIS, it will use that.
The first task is to generate the coordinate data corresponding to a particular compound. Then the free energy cf this conformation in solution is calculated. Next the program begins to change torsion angles specified by the user in the conformation and with each new conformation, a statistical thermodynamic probability is calculated, based upon potential (steric, electrostatic, and torsional) functions and terms for the free energy associated with hydrogen-bending molecule-solvent and molecule-dipole interactions.
The cost of using CAMSEQ-2 is $60 per hour on the ISC computer, using either the CAS registry number, X-ray crystal data or a user generated connection table.
k. WaterDROP
Over the past few years, the improved sensitivity Of analytical methods, particularly mass spectrometry, has permitted the accumulation of information about environmental pollutants. In the water systems of the United States, the US EPA has found many chemicals, of which over 1300 have been identified. In addition, similar results have keen obtained in Europe, under the guidance of the European Community (EC). The result of these activities has been the accumulation of considerable information about the identity of a potentially toxic chemical and where it may have been found. The EPA research laboratory in Athens, Georgia, realizing the need for a centralized source for collection, storage and dissemination this information has started to develop a Distribution Register of Organic Pollutants in Hater (WaterDROP) [34]. The WaterDROP system, which was started in the sunder of 1978, contains the identity of the chemical found, the sampling site and date, reporting laboratory, analytical method used and date of entry into the system. The WaterDROP system data will be collected in a number of ways and the main automatic collection of data for the system is envisioned to come from laboratories within EPA and elsewhere. A schematic of this is shown in Figure 13. In this diagram, MSSS users identify the unknown toxic pollutant using a Biemann type search. The Biemann search procedure has been codified so that EPA laboratories are required to enter additional information when conducting a search. As each laboratory identifies an unknown, the central computer is building up a data base cf information for WaterDROP. the results of all these Biemann searches will be a centralized report file, such as shown in Figure 14. In this figure one can see the results of search in that besides the usual MESS results, the river, river mile, Longitude, Latitude, date and laboratory entering the data have been recorded. With the anticipated international cooperation in building this data base, the WaterDROP file will grow quickly from these centralized reporting activities being used under the MSSS.
The data bank will be Published by the EPA, as well as being searchable under SANSS and the WaterDROP software now being developed. Once the system is available, sometime in mid-1979, it will be possible to answer questions concerning locations in which specific pollutants are found and perhaps to recognize patterns of pollution which relate to plant effluent problems.
Answers to these and other similar questions, coupled with the toxicity data from RTECS and other CIS sources, should provide valuable technical facts to enable Governments to regulate and control pollutants acre effectively.
l. Aquatic Toxicity (AQUATOX)
Owing to the importance of fish in human nutrition, the concern over the danger posed to fish by chemicals is being recognized as a major activity cf EPA and other US Government groups. A data bank of aquatic toxicity is being developed by EPA, in conjunction with ASTM Committee E-35.21.01. This data bank. expected to be available for testing on CIS in the latter part of 1979, will bare information en the chemicals found in fish, the reported toxicities, the literature citations, common and scientific names of the species studied, temperature, pa and hardness or salinity of the water in the study, salinity cf the water and a comments section for other desired information related to the study.
m. Other CIS components.
There are a number of other data bases of valuable numeric information which are being built and obtained or expanded from existing sources. These include Infrared (IR), mutagenic/teratogenic studies, partition coefficients and thermodynamic data.
Anyone interested in the status of these and other such CIS projects is urged to write to either of the authors for a copy of the CIS status reports, informal progress reports issued jointly by EPA and NIH every six months.
SUMMARY
One of the first goals of the CIS was to produce a series cf searchable chemical data bases for use to working analytical chemists with no especial computer expertise. A second aim was to link these data bases together so that the user need not te restricted to a consideration of for example, only mass spectral data.
The various problems inherent in these plans included acquisition of data bases, design of programs, dissemination of the resulting system and linking, via CAS registration numbers, of the various CIS components. These problems, as has been described above, have been solved conceptually and, to a large extent, practically, and the CIS, as it now stands, is the result.
Review of the system in an effort to define future goals is under way, as well as
a number of specific improvements are currently being made to the CIS software. For example, searches through more than one data base in combination would be very desirable. For example, one often possesses mass spectral and nor data for an unknown and it would be very useful to te able to identify any compounds that match these data in a single search. Work is going on in this area to interface programs sc that this approach can be tested.In another development, it is expected that the CONGEN programs developed for the DENDRAL project [30] will te merged into CIS within the next year. This program, which generates structures corresponding to a specific empirical formula, could be extremely useful in a strategy for structure solving using the CIS. It is not at all difficult to envisage situations in which a reduced set of structures could be produced for consideration by CONGEN. Each structure in turn could be used as an input in the substructure search System and the various compounds whose registry numbers are so retrieved could be considered to be possible answers to the problem.
Confirmation for any of then could then be sought in the spectral data bases, the registry number being all that is necessary to locate and retrieve data. One can even speculate further to the day when synthetic pathways to any likely, hut unavailable, candidates could be designed by the computer system which could easily add the very practical touch of checking that any starting materials for such syntheses are commercially available at an appropriately low cost!
In a different approach, the power of pattern recognition techniques could be assessed within some of the very large files contained in the CIS. This is a very useful exercise because there is little reported work of this sort on large files and thus we have begun to explore the value of such methods in handling the problem of identification of true unknowns such as water pollutants. Programs designed to test mass spectra for the presence in the compound of elements or groups, such as halogens and aromatic rings, are currently being written [31] and their
utility as pre-filters on mass spectral data prior to data base searching will be tested as soon as is possible.Progress to date with the CIS has demonstrated economic feasibility and scientific value in support of TSCA. The test before us is whether we can capitalize on this to explore the new and exciting possibilities that lie ahead in the area of using computer systems, such as the CIS, to support TSCA goals and assist in the demands of the country for a safer and healthier environment.
LITERATURE CITED.
1. Heller, S. R., Milne, G. W. A., and Feldmann, R. J., Science, (1977), 195,253.
2. Heller, S. R., Fales, H. M.., and Milne, G. W. A., Org. Mass
Spectrom., (1973),7,107; Heller, S. R., Koniver, D. A., Fales, H. M., and Milne, G. W. A., Anal. Chem., (1974), 46, 947; Heller, S. R., Feldmann, R. J., Fales, H. M.., and Milne, G. W. A., J. Chem. Doc. (1973),13,130; Heller, S. R. and Milne, G. W. A., J. Chem. Info. Comp. Sci., (1976) ,16,176.3. Dalrymple, D. L., Wilkins, C. L.., Milne, G. W.A., and Heller,
S. R., Org. Mag. Res., (1978), 11, 535.
4. Kennard, O., Watson, D. G., Town, W.G., J. Chem. Doc.,
(1372),12,14.
5. McCarthy, G. and Johnson, G. G., paper C3 presented as a part of the Proceedings of the American Crystallographic Association meeting, State College, PA., 1374
6. Marquart, R. G., I. Katsnelson,, I., Milne, G. W. A. , Heller, S. R., Johnson Jr., G. G., and Jenkins, R.,unpublished results.
7. NIOSH, Registry of Toxic Effects cf Chemical Substances, Volumes 1 & 2, DHEW (NIOSH) # 78-104-A 11977). GPO # 017-033-0027101, Government Printing Office, Washington, DC.
8. Unpublished EPA data. For further information one can either Charles Stephan, EPA, Duluth, as 55804 or Steve Schimmel, EPA, Gulf Breeze, Sabine Island, El 32561.
9. These include the bibliographic file associated with the data in reference 4 and the Mass Spectrometry Literature Bulletin, published by the Mass Spectrometry Data Centre, UKCIS, The University, Nottingham, England.
10. Knott, G.D., and Shrager, R.I., Assn. Comp. Machin.,
SIGGRAPH Notes 6, (1972), 138.
11. Heller, S. R. and Jacobson, A. E, Anal.
Chem., (1972), 44, 2219.
12. Weintraub, H. J. and Hopfinger, A. J., Intnl. J. Quant.
Chem., (1975), 9, 203; Potenzone R., Cavicchi, E., Weintraub, H .J .R., and Hopfinger A. J., Comp. and Chem.,(1977), 1, 187.13. Gelernter, H.L., Sanders, A.F., Larsen, D. L., Agarwal, K .K., Boivie, R.H., Spitzer, G.A., and Searleman, J. L., Science, (1977) ,197,1041.
14. Feldmann, R. J., Milne, G. W. A., Heller, S. R., Fein, A., Miller, J. A., and Koch, E.. J. Chem. Inf. and Comp. Sci., (1977),17,157; Milne, G. W.A., Heller, S. R., Fein, A. E., Frees, E. F., Marquart, R. G., McGill, J. A., Miller, J. A., and Spies D. S., J. Chem. Info. and Comp. Sci., (1978), in press, and references cited therein.
15. Heller, S. R. and Milne, G. W. A., and Feldmann, R. J., J. Chem. Inf. and Comp. Sci., (1976), 16, 232.
16. EPA Order #2800.2, issued May 27,1975.
17. The NIH/EPA/MSDC data base is available for lease from the US National Bureau of Standards, Office of Standard Reference Data, A537 Administration Building, Washington, DC 20234. [Telephone 301-921-2467]..
18. Speck, D.D., Venhataraghavan, R., McLafferty, F. W.., Org. Mass. Spec.(1978), 13, 208.
19. Clerc, T. .,R. Schuarzenbach, J. Meili, and H. Koenitzer, Org. Magn. Reson., (1976), 8, 11.
20. These data are available as NBS tape #9. Contact the National Technical Information Service (NTIS), Springfield, VA 22151 for details.
21. Hanawalt, J.D., Rinn, H. W., and Frevel, L.K., Ind. Eng. Chem., (1938), 10, 457.
22. Environmental Protection Agency (EPA), Toxic Substances Control Act (TSCA) Inventory Reporting Requirements, Federal Register, 42, 247, Friday December 23, 1977, pages 64572-64596. In particular, see section 710.7 on pages 64579-64580.
23. Abramson, F. P., Anal. Chen.,(1975), 47, 45.
24. PL 91-596, 0ccupational Safety and Health Act of 1970 (OSHA), section 20 (a).
25. O'Korn, L. J., Chapter 6 in "Algorithms for Chemical Computations", ed. by R. E. Christoffersen, ACS Symposium Series #46, (1977).
26. PL-94-469, Toxic Substances Control Act of 1976 (TSCA).
27. Vinton, V.A., Milne, G. W..A., and Heller, S. R., Anal. Chim. Acta, (1977), 95, 41.
28. Hartmann, K., Lias, S., Ausloss, P.J., and Rosenstock, H.M., Publication NBSIR 76-1061, July 1976.
29. Johannesen, R. E., Ferretti, J. A., and Harris, R. K., J. Magn. Res., (1970), 3, 84.
30. Carhart, R.E., Smith, D.H., Brown, H., and Djerassi, C., J. Amer. Chem. Soc.,
(1975), 97, 5755.31. Meisel, W., Jolley, H. and Heller, S. R., in preparation.
32. Details on the availability of the CNMR data base can be obtained from: Dr. Charles L. Citroen, Netherlands Information Combine, CID-TNO, PO Box 36, 2600 AA, Delft, The Netherlands.
33. Telenet Communications Corporation, 8330 Old Courthouse Road, Vienna, VA 22180 (703-627-9200).
34. Further Information on the WaterDROP system can be obtained from: Ms. Ann Alford, EPA, ERL, College Station Road, Athens, Georgia, 30601
FIGURE CAPTIONS.
Figure 1. Search for aromatic chloro, bromo compounds in the CIS Unified Data Base.
Figures 2a,b. List of the current 40 collections which comprise the CIS Unified Data Base.
Figure 3. The CIS components, their status, and how they are linked together
Figure 4. Protocol for adding a component to the CIS.
Figure 5. PEAK search in the MSSS.
Figure 6. Space group and molecular weight search in the Cambridge crystal data base.
Figure 7. Search for acute toxicity data
Figure 8. Schematic Representation of SANSS.
Figure 9. Name Search (NPROBE) for Dioxin.
Figure 10. SANSS Fragment probe search.
Figure 11. SANSS Ring/Nucleus probe search.
Figure 12. Complete structure (IDENT) search.
Figure 13. Schematic of automated data collection activities for WaterDROP.
Figure 14. Sample entries for the WaterDROP system from a modified MSSS Biemann search.