by Guest Columnist
Stephen R. Heller
Washington, DC 20460
and George W. A. Milne
Bethesda, MD 20205
Most of the articles appearing in this journal have related to bibliographic databases. The NIH/EPA Chemical Information System (CIS) contains primarily numeric and Chemical structure data, and thus is complimentary to such services as DIALOG, ORBIT, TOXLINE and BRS. In this article we hope to convey the unique features of the CIS and to show how they may be of use to the scientific community.
The CIS consists of a collection of chemical databases together with a battery of computer programs for interacting searching through these disk-stored databases. In addition the CIS has a data referral capability as well as a data analysis software system. It can be thought of then, as having four main areas:
1. Numerical Databases
2. Data Analysis Software
3. Structure and Nomenclature Search System
4. Database Referral
The numeric databases that are part of the CIS include files of mass spectra, carbon-13 nuclear magnetic resonance, X-ray diffraction data for single crystals and powders, acute toxicity data, and aquatic toxicity data. There are bibliographic databases associated directly with the mass spectrometry, and X-ray crystallography files. The analytical programs include a family of statistical analysis and mathematical modelling algorithms and programs for the second order analysis of nmr spectra and energy minimization of chemical conformations. Programs that design chemical syntheses are being tested and may, if viable, become part of the CIS in the future. In addition there are a number of other CIS components either under discussion or development. These include a thermodynamics data bank, an infared data bank, an aquatic toxicity data bank, an Ames's test data bank, a computer searchable version of the Clinical Toxicology of Commercial Products (CTCP), public TSCA plant and public production data bank, a distribution register of organic chemicals found in water (WaterDROP), a database of computer generated partition coefficient data, and lastly, the Stanford University CONGEN program, which generates all possible chemical structures for a given molecular formula.
SEVERAL COMPONENTS OF CIS
The current CIS includes the following components:
Structure and Nomenclature Search System (SANSS) provides a database of over 140,000 compounds collected from over 41 different sources. All of the CIS databases are searchable through SANSS by full, partial, left- or right-truncated name, structure, chemical functional group code, molecular formula or molecular weight. It should be noted that the left-truncated name capability of SANSS applies only to subsets of the database, not the entire database, which is searchable by right-truncation. The data obtained from SANSS includes the CAS REGN (Chemical Abstracts Service Registry Number), other databases containing information on the substances, molecular formula, structural diagram, and systematic names as well as synonyms and trade names. Lastly, the results of a search can be structurally drawn out on either a teletype or graphics terminal. SANSS is unique among publicly available systems in this structure drawing ability.
Mass Spectral Search System (MSSS) contains electron-impact mass spectra of over 33,000 compounds, which can be searched on the basis of normal peak and intensity, as well as by the Biemann and probability based-matching (PBM) techniques. Mass spectral bibliographic information, from the Mass Spectrometry Data Centre (MSDC) in England, is also available. A smaller file of over 500 chemical ionization spectra are also searchable within the system.
EPA's Oil and Hazardous Material-Technical Assistance Data System (OHM-TADS) provides information pertinent to emergency spill response efforts. The OHM-TADS database includes a wide variety of physical, chemical, biological, toxicological and commercial data on these materials, with emphasis placed on their deleterious effects on water quality. Up to 126 different fields of information are maintained for more than 1,000 materials.
Registry of Toxic Effects of Chemical Substances (RTECS) is provided by the National Institute of Occupational Safety and Health to the CIS. The RTECS publication, with over 40,000 toxicological measurements, is available as a CIS component, with a set of programs to search the file on the basis of toxicological data and designators (e.g., animal type and LD50).
Powder Diffraction Search-Match (PDSM) contains over 33,000 powder diffraction patterns, provided by the JCPDS - International Centre for Diffraction Data, for identifying compounds based upon the characteristics of their powder diffraction patterns, as opposed to x-ray single crystal identification.
Carbon-13 Nuclear Magnetic Resonance Spectral Search System (CNMR) contains CNMR spectra of over 8,500 compounds. Searches by chemical shift requirements are permitted; analysis and display of this information for compounds of interest may also be obtained.
X-ray Crystallographic Search System (CRYST) contains the bibliographic and structural files of the Crystallographic Data Centre (Cambridge, England). Atomic coordinates and cell parameters for about 15,000 compounds are contained in CRYST.
The CRYST information is available for searching on either a structural or bibliographic basis. (Use of the CRYST system outside of the USA requires the written permission of the Cambridge Data Centre and may require an additional payment to the Centre.
Mathematical Modeling System (MLAB) is an interactive system for mathematical modeling. This component provides major capabilities in the areas of curve and distribution fitting, linear and non-linear regression, statistical analysis, differential an integral calculus, and two- and three-dimensional plotting.
The Conformational Analysis of Mole-cules in Solution by Empirical and Quantum-Mechanical Techniques System (CAMSEQ-II) provides the capability for calculating the conformation of a molecule in solution and estimating many properties of a molecule in solution, given either its CAS REGN from SANSS, or a two-dimensional structural representation, such as provided by SANSS and CRYST.
The following two CIS components are expected to be available in the second half of 1979:
X-ray Single Crystal Search System (X-RAY) provides a search of the Crystal Data Determinative Tables published by the National Bureau of Standards and the JCPDS. The space group, density, unit cells and chemical types are searchable, with the determinative tables being printed out after a search is performed.
Federal Registry Notices (FR) provides a cross reference to all of the citations of a chemical or class of chemicals cited in the Federal Register since January 1, 1978. The title, part, subpart, and a short description of the notice of the chemical of interest is available for printout.
The above CIS components are available only in the private sector for use by the government, industry, universities and the public through an annual subscription fee of $300 and an hourly connect charge, as shown below. The CIS uses the Telenet network, and thus is available throughout the US, Canada, most of Western Europe, Israel, Australia, Japan, Hong Kong and Singapore via local phone call. In the US and some other countries both 300 and 1200 baud service are available using an available terminal.
The hourly connect rate is either $36 or $60 in the US and Canada. In countries with PTT's users are given a $12 per hour credit, and their communication costs are billed directly to the user by the PTT. The hourly rates include all the computer processing units and connect minutes accumulated in a program.
$36/Hour Components: MSSS, OHM, TADS, RTECS, CNMR, CRYST AND X-RAY.
$60/Hour Components: SANSS,
PDSM, FR, CAMSEQ-II, MLAB.
SANSS - THE CENTRAL PART OF
There have been a number of articles published about the CIS and its components including the recent article in Database (March 1979, pages 35 ff.). (As an aside, it should be mentioned that the contract with Brookhaven National Laboratory has ended and a new CIS operations contractor, ISC has been hired - see reference 1.) In this article we will not repeat any details, but rather refer to these publications, such as the CIS overview published in 1977 in Science. The remainder of the paper here will deal with the central part of the CIS, namely SANSS, its list referral capability, and its unique linking to other systems of data and information.
As Buntrock described in a recent
Chem-corner article (March 1979), there are
a number of sources of CAS Registry
Number data. Each of these sources were
established "independently" for different
purposes. Also, these sources are searchable
in a number of different ways. What we
hope to explain here is the differences
between these sources, and how they, for the
most part, actually complement each other.
DIFFERENT REGISTRY NUMBER SOURCES
The CHEMLINE file, the first of its
kind, was established at the beginning of this
decade, CHEMLINE consists primarily of
the chemicals found in the CAS-CBAC
publication, as well as a number of other
bibliographic files available in the
TOXLINE system at NLM. It contains over
350,000 different chemicals. The file can be
searched by name, name fragment, right
truncation and a number of textual CAS
chemical ring data keys. There is no
structure output in CHEMLINE.
The CHEMNAME file on Lockheed's
DIALOG system is based on all the
chemicals referenced two or more times in
the CA 9th Collective Index (1972-1976). It
can be searched essentially the same way the
CHEMLINE file is searched. There is no
structure output in CHEMNAME.
The CHEMDEX system on SDC's
ORBIT system is based on all the chemicals
referred to (any number of times) in the CA
files since 1972. The file now only covers
1972 and 1973 and contains about 500,000
entries; however, SDC has indicated the file
is expected to contain in excess of 2 million
compounds. The file is searchable in the
same way as the previous two systems and,
in addition, is searchable with left truncation
capability. There is no structure output in
SANSS - MUCH DIFFERENT FILE
This CIS/SANSS system is based on a
very different concept of file construction.
SANSS is based on collecting and
integrating list of chemicals from "relevant"
sources, primarily government files and files
of numeric data. For example, at present
SANSS has 41 lists of chemicals integrated
into one master Unified Database of some
140,000 chemicals. Among the lists which
comprise the 41 are the TSCA inventory,
Merck Index, NIOSH RTECS, Mass
Spectrometry, Oak Ridge Mutagen and
Teratogen files, Carbon-13 NMR, X-ray
data files (3), EPA Pesticidies, NCI
carcinogens, OSHA carcinogens, and
CHEMLINE chemicals already in SANSS.
FIRST REGISTRY FOR OVER 35,000
As a result of using list of chemicals, rather than specific bibliographic sources, a very interesting file was created. The primary unique feature that comes from this approach is that, of the 140,000 chemicals, over 35,000 were given CAS Registry Numbers for the first time. This means that there were no bibliographic literature citations in the CAS system at the time these chemicals were first registered for CIS. The reason for this is three-fold. Firstly, the CAS Registry started in 1965 and some of the chemicals in SANSS files are pre-1965. Secondly, some of the chemicals are either not in CAS-abstracted literature or are in government publications. Lastly, some of the chemicals in SANSS, namely those on the TSCA Inventory, were defined and registered for the government in support of a regulatory statute, and hence have never before described in this particular manner in the literature.
While the chemicals that have been assigned CAS Registry numbers for the first time come from most SANSS files, some have been major contributors. These include, with their corresponding approximate number of new Registry numbers:
1. NIOSH RTECS (1400), 2. Mass Spec (7800) 3. NBS Single Crystal (4000) 4. Tokyo Thermodynamics (1000) 5. US International Trade Commission (1500) 6. PHS-149 List of Chemicals tested for carcinogenicity (700) 7. NIMH Psychotropic Drugs (300) 8. EMIC (300)
Of all the above cases of new registration, the most interesting is probably the ITC list of chemicals in import/export commerce. The total lack of information in the CAS literature from 1965 to date of registration (1977) of these chemicals could be used to put in priority a list of chemicals for further testing and studies. As it turns out, most of the chemicals are from the Colour Index or are surfactants.
In order to do a search in SANSS, one must first decide on how to set up the query. SANSS has the distinction, as well as the problem of having the most diverse set of search techniques available in any system of its kind. Searches can be performed in a number of way, including:
1. Name search (NPROBE)
2. Structure Key search (PROBE)
3. Ring search (RPROBE)
4. Fragment search (FPROBE)
NAME SEARCH APPROACH MOST
As one can readily guess, the most used
technique is the NPROBE or name search
approach. To do a CIDS structure key
search requires the knowledge of the key
codes, for which there is a good, but lengthy
manual. To do a ring or fragment search
requires setting up a structure diagram using
a simple set of commands that "draws" the
chemical structure/picture at your terminal.
While the SANSS flexibility helps both the
inexperienced as well as experienced user, it
does take about - 1 hour to get
accustomed to the system. Once a search is
performed, by any of the above (or
combinations of the above), the results
(actually the CAS Registry numbers) are
stored in a file, the same way as is done on
the ELHILL, ORBIT AND DIALOG
systems. By doing this we are able to either
combine this list with another search list, or
use the list to look up data from other
files/computers using the CAS REGN as the
link between data/information systems.
NOT ALL CHEMICALS IN FILE HAVE
It is important to note that not all the
chemicals in SANSS (or for that matter in
the CAS Registry system of over 4.5
million) have completely defined structures.
Such "unstructurable" chemicals, while well
defined in a chemical, as well as legal sense,
can be searched, but they are searchable only
in the name search part of the SANSS. Such
chemicals include substances of unknown or
variable composition, complex reaction
products and biological (UVCB) materials;
the so-called UVCB chemicals. Some of the
definitions of these chemicals run into
paragraphs, but each of the words (i.e.,
keywords) in the definition are fragmented
when the file is processed and thus
searchable as name fragments. Polymers are
also handled this same way, using name
searching only (i.e., the SANSS NPROBE
option). One last point of importance to
note is that because of the non-structurable
chemical in SANSS, there are more
chemicals that can be searched by name than
by structure in SANSS, namely 100% are
name searchable, while only some 90% are
AN EXAMPLE OF IDENTIFYING AN
In the examples below, the purpose is to
show how CIS can be used to identify an
unknown, given the facts that a liquid
chemical has been found which has a bitter
taste. The first thing done was to obtain a
mass spectrum of the unknown, and then go
to the CIS. As the example below shows,
the mass spectrum alone does not allow for
the identification, and after obtaining some
toxicity data (via RTECS) and the structures
of the two mass spec hits (via SANSS),
finally the OHM-TADS system is used to
distinguish between the two possible
answers. Lastly, the CIS-DIALOG link,
using the CAS REGN is used to get some
literature references to the unknown.
(In the example shown below, the
underlined information is that which the user
enters; all other information is printed out by