_______________________

CHEMCORNER

by Guest Columnist

-----------------------------------

Stephen R. Heller

EPA, PM-218

Washington, DC 20460

and George W. A. Milne

NHLBI, NIH

Bethesda, MD 20205

INTRODUCTION

Most of the articles appearing in this journal have related to bibliographic databases. The NIH/EPA Chemical Information System (CIS) contains primarily numeric and Chemical structure data, and thus is complimentary to such services as DIALOG, ORBIT, TOXLINE and BRS. In this article we hope to convey the unique features of the CIS and to show how they may be of use to the scientific community.

The CIS consists of a collection of chemical databases together with a battery of computer programs for interacting searching through these disk-stored databases. In addition the CIS has a data referral capability as well as a data analysis software system. It can be thought of then, as having four main areas:

1. Numerical Databases

2. Data Analysis Software

3. Structure and Nomenclature Search System

4. Database Referral

The numeric databases that are part of the CIS include files of mass spectra, carbon-13 nuclear magnetic resonance, X-ray diffraction data for single crystals and powders, acute toxicity data, and aquatic toxicity data. There are bibliographic databases associated directly with the mass spectrometry, and X-ray crystallography files. The analytical programs include a family of statistical analysis and mathematical modelling algorithms and programs for the second order analysis of nmr spectra and energy minimization of chemical conformations. Programs that design chemical syntheses are being tested and may, if viable, become part of the CIS in the future. In addition there are a number of other CIS components either under discussion or development. These include a thermodynamics data bank, an infared data bank, an aquatic toxicity data bank, an Ames's test data bank, a computer searchable version of the Clinical Toxicology of Commercial Products (CTCP), public TSCA plant and public production data bank, a distribution register of organic chemicals found in water (WaterDROP), a database of computer generated partition coefficient data, and lastly, the Stanford University CONGEN program, which generates all possible chemical structures for a given molecular formula.

SEVERAL COMPONENTS OF CIS

The current CIS includes the following components:

Structure and Nomenclature Search System (SANSS) provides a database of over 140,000 compounds collected from over 41 different sources. All of the CIS databases are searchable through SANSS by full, partial, left- or right-truncated name, structure, chemical functional group code, molecular formula or molecular weight. It should be noted that the left-truncated name capability of SANSS applies only to subsets of the database, not the entire database, which is searchable by right-truncation. The data obtained from SANSS includes the CAS REGN (Chemical Abstracts Service Registry Number), other databases containing information on the substances, molecular formula, structural diagram, and systematic names as well as synonyms and trade names. Lastly, the results of a search can be structurally drawn out on either a teletype or graphics terminal. SANSS is unique among publicly available systems in this structure drawing ability.

Mass Spectral Search System (MSSS) contains electron-impact mass spectra of over 33,000 compounds, which can be searched on the basis of normal peak and intensity, as well as by the Biemann and probability based-matching (PBM) techniques. Mass spectral bibliographic information, from the Mass Spectrometry Data Centre (MSDC) in England, is also available. A smaller file of over 500 chemical ionization spectra are also searchable within the system.

EPA's Oil and Hazardous Material-Technical Assistance Data System (OHM-TADS) provides information pertinent to emergency spill response efforts. The OHM-TADS database includes a wide variety of physical, chemical, biological, toxicological and commercial data on these materials, with emphasis placed on their deleterious effects on water quality. Up to 126 different fields of information are maintained for more than 1,000 materials.

Registry of Toxic Effects of Chemical Substances (RTECS) is provided by the National Institute of Occupational Safety and Health to the CIS. The RTECS publication, with over 40,000 toxicological measurements, is available as a CIS component, with a set of programs to search the file on the basis of toxicological data and designators (e.g., animal type and LD50).

Powder Diffraction Search-Match (PDSM) contains over 33,000 powder diffraction patterns, provided by the JCPDS - International Centre for Diffraction Data, for identifying compounds based upon the characteristics of their powder diffraction patterns, as opposed to x-ray single crystal identification.

Carbon-13 Nuclear Magnetic Resonance Spectral Search System (CNMR) contains CNMR spectra of over 8,500 compounds. Searches by chemical shift requirements are permitted; analysis and display of this information for compounds of interest may also be obtained.

X-ray Crystallographic Search System (CRYST) contains the bibliographic and structural files of the Crystallographic Data Centre (Cambridge, England). Atomic coordinates and cell parameters for about 15,000 compounds are contained in CRYST.

The CRYST information is available for searching on either a structural or bibliographic basis. (Use of the CRYST system outside of the USA requires the written permission of the Cambridge Data Centre and may require an additional payment to the Centre.

Mathematical Modeling System (MLAB) is an interactive system for mathematical modeling. This component provides major capabilities in the areas of curve and distribution fitting, linear and non-linear regression, statistical analysis, differential an integral calculus, and two- and three-dimensional plotting.

The Conformational Analysis of Mole-cules in Solution by Empirical and Quantum-Mechanical Techniques System (CAMSEQ-II) provides the capability for calculating the conformation of a molecule in solution and estimating many properties of a molecule in solution, given either its CAS REGN from SANSS, or a two-dimensional structural representation, such as provided by SANSS and CRYST.

The following two CIS components are expected to be available in the second half of 1979:

X-ray Single Crystal Search System (X-RAY) provides a search of the Crystal Data Determinative Tables published by the National Bureau of Standards and the JCPDS. The space group, density, unit cells and chemical types are searchable, with the determinative tables being printed out after a search is performed.

Federal Registry Notices (FR) provides a cross reference to all of the citations of a chemical or class of chemicals cited in the Federal Register since January 1, 1978. The title, part, subpart, and a short description of the notice of the chemical of interest is available for printout.

The above CIS components are available only in the private sector for use by the government, industry, universities and the public through an annual subscription fee of $300 and an hourly connect charge, as shown below. The CIS uses the Telenet network, and thus is available throughout the US, Canada, most of Western Europe, Israel, Australia, Japan, Hong Kong and Singapore via local phone call. In the US and some other countries both 300 and 1200 baud service are available using an available terminal.

The hourly connect rate is either $36 or $60 in the US and Canada. In countries with PTT's users are given a $12 per hour credit, and their communication costs are billed directly to the user by the PTT. The hourly rates include all the computer processing units and connect minutes accumulated in a program.

$36/Hour Components: MSSS, OHM, TADS, RTECS, CNMR, CRYST AND X-RAY.

$60/Hour Components: SANSS, PDSM, FR, CAMSEQ-II, MLAB.

SANSS - THE CENTRAL PART OF THE CIS

There have been a number of articles published about the CIS and its components including the recent article in Database (March 1979, pages 35 ff.). (As an aside, it should be mentioned that the contract with Brookhaven National Laboratory has ended and a new CIS operations contractor, ISC has been hired - see reference 1.) In this article we will not repeat any details, but rather refer to these publications, such as the CIS overview published in 1977 in Science. The remainder of the paper here will deal with the central part of the CIS, namely SANSS, its list referral capability, and its unique linking to other systems of data and information.

As Buntrock described in a recent Chem-corner article (March 1979), there are a number of sources of CAS Registry Number data. Each of these sources were established "independently" for different purposes. Also, these sources are searchable in a number of different ways. What we hope to explain here is the differences between these sources, and how they, for the most part, actually complement each other.

DIFFERENT REGISTRY NUMBER SOURCES

The CHEMLINE file, the first of its kind, was established at the beginning of this decade, CHEMLINE consists primarily of the chemicals found in the CAS-CBAC publication, as well as a number of other bibliographic files available in the TOXLINE system at NLM. It contains over 350,000 different chemicals. The file can be searched by name, name fragment, right truncation and a number of textual CAS chemical ring data keys. There is no structure output in CHEMLINE.

CHEMNAME

The CHEMNAME file on Lockheed's DIALOG system is based on all the chemicals referenced two or more times in the CA 9th Collective Index (1972-1976). It can be searched essentially the same way the CHEMLINE file is searched. There is no structure output in CHEMNAME.

CHEMDEX

The CHEMDEX system on SDC's ORBIT system is based on all the chemicals referred to (any number of times) in the CA files since 1972. The file now only covers 1972 and 1973 and contains about 500,000 entries; however, SDC has indicated the file is expected to contain in excess of 2 million compounds. The file is searchable in the same way as the previous two systems and, in addition, is searchable with left truncation capability. There is no structure output in CHEMDEX.

SANSS - MUCH DIFFERENT FILE CONSTRUCTION

This CIS/SANSS system is based on a very different concept of file construction. SANSS is based on collecting and integrating list of chemicals from "relevant" sources, primarily government files and files of numeric data. For example, at present SANSS has 41 lists of chemicals integrated into one master Unified Database of some 140,000 chemicals. Among the lists which comprise the 41 are the TSCA inventory, Merck Index, NIOSH RTECS, Mass Spectrometry, Oak Ridge Mutagen and Teratogen files, Carbon-13 NMR, X-ray data files (3), EPA Pesticidies, NCI carcinogens, OSHA carcinogens, and CHEMLINE chemicals already in SANSS.

FIRST REGISTRY FOR OVER 35,000 CHEMICALS

As a result of using list of chemicals, rather than specific bibliographic sources, a very interesting file was created. The primary unique feature that comes from this approach is that, of the 140,000 chemicals, over 35,000 were given CAS Registry Numbers for the first time. This means that there were no bibliographic literature citations in the CAS system at the time these chemicals were first registered for CIS. The reason for this is three-fold. Firstly, the CAS Registry started in 1965 and some of the chemicals in SANSS files are pre-1965. Secondly, some of the chemicals are either not in CAS-abstracted literature or are in government publications. Lastly, some of the chemicals in SANSS, namely those on the TSCA Inventory, were defined and registered for the government in support of a regulatory statute, and hence have never before described in this particular manner in the literature.

While the chemicals that have been assigned CAS Registry numbers for the first time come from most SANSS files, some have been major contributors. These include, with their corresponding approximate number of new Registry numbers:

1. NIOSH RTECS (1400), 2. Mass Spec (7800) 3. NBS Single Crystal (4000) 4. Tokyo Thermodynamics (1000) 5. US International Trade Commission (1500) 6. PHS-149 List of Chemicals tested for carcinogenicity (700) 7. NIMH Psychotropic Drugs (300) 8. EMIC (300)

Of all the above cases of new registration, the most interesting is probably the ITC list of chemicals in import/export commerce. The total lack of information in the CAS literature from 1965 to date of registration (1977) of these chemicals could be used to put in priority a list of chemicals for further testing and studies. As it turns out, most of the chemicals are from the Colour Index or are surfactants.

In order to do a search in SANSS, one must first decide on how to set up the query. SANSS has the distinction, as well as the problem of having the most diverse set of search techniques available in any system of its kind. Searches can be performed in a number of way, including:

1. Name search (NPROBE)

2. Structure Key search (PROBE)

3. Ring search (RPROBE)

4. Fragment search (FPROBE)

NAME SEARCH APPROACH MOST COMMON

As one can readily guess, the most used technique is the NPROBE or name search approach. To do a CIDS structure key search requires the knowledge of the key codes, for which there is a good, but lengthy manual. To do a ring or fragment search requires setting up a structure diagram using a simple set of commands that "draws" the chemical structure/picture at your terminal. While the SANSS flexibility helps both the inexperienced as well as experienced user, it does take about - 1 hour to get accustomed to the system. Once a search is performed, by any of the above (or combinations of the above), the results (actually the CAS Registry numbers) are stored in a file, the same way as is done on the ELHILL, ORBIT AND DIALOG systems. By doing this we are able to either combine this list with another search list, or use the list to look up data from other files/computers using the CAS REGN as the link between data/information systems.

NOT ALL CHEMICALS IN FILE HAVE COMPLETELY DEFINED STRUCTURES

It is important to note that not all the chemicals in SANSS (or for that matter in the CAS Registry system of over 4.5 million) have completely defined structures. Such "unstructurable" chemicals, while well defined in a chemical, as well as legal sense, can be searched, but they are searchable only in the name search part of the SANSS. Such chemicals include substances of unknown or variable composition, complex reaction products and biological (UVCB) materials; the so-called UVCB chemicals. Some of the definitions of these chemicals run into paragraphs, but each of the words (i.e., keywords) in the definition are fragmented when the file is processed and thus searchable as name fragments. Polymers are also handled this same way, using name searching only (i.e., the SANSS NPROBE option). One last point of importance to note is that because of the non-structurable chemical in SANSS, there are more chemicals that can be searched by name than by structure in SANSS, namely 100% are name searchable, while only some 90% are structure searchable.

AN EXAMPLE OF IDENTIFYING AN UNKNOWN

In the examples below, the purpose is to show how CIS can be used to identify an unknown, given the facts that a liquid chemical has been found which has a bitter taste. The first thing done was to obtain a mass spectrum of the unknown, and then go to the CIS. As the example below shows, the mass spectrum alone does not allow for the identification, and after obtaining some toxicity data (via RTECS) and the structures of the two mass spec hits (via SANSS), finally the OHM-TADS system is used to distinguish between the two possible answers. Lastly, the CIS-DIALOG link, using the CAS REGN is used to get some literature references to the unknown.

(In the example shown below, the underlined information is that which the user enters; all other information is printed out by the computer.)