The NIH/EPA Chemical Information System
STEPHEN R. HELLER
Environmental Protection Agency, PM-218
Washington, DC 20460
G. W. A. MILNE
National Institutes of Health
Bethesda, MD 20014
Over the past seven years, NIH and EPA have developed a computer based Chemical Information System (CIS), which is an online interactive computer system that handles chemical and toxicological data (1). The CIS consists mainly of a collection of numeric (as opposed to bibliographic) data bases and software to search these data bases. The four main areas of the CIS can be grouped as follows:
1. Searchable numeric data bases
2. Structure and Nomenclature Search system (SANSS)
3. Chemical Substance Information System (CSIS)
4. Analysis and Modeling Programs
The first three areas will be described, with emphasis on the linking of areas 1 and 2.
Figure 1 shows how the four areas of the CIS are coordinated, with the Structure and Nomenclature Search System (SANSS) in the center. At present there are 25 data bases in the SANSS. These comprise the CIS Unified Data Base (UDB) and are searchable by the SANSS (2). They are shown in Figure 2. The referral aspects of the CIS represent a valuable tool for scientific and administrative work both within our respective Agencies as well as outside these Agencies, in the public and private sector, here in the USA and abroad. The referral capability of the CIS consists of a list of data bases, literature references (e.g., Merck Index) and Government Regulatory files, which can all be accessed simultaneously by consulting a single central file. All the available information concerning a substance can be located in a single operation. As a number of data bases is increased, the CIS becomes more valuable and a time-saving device in searches for chemical information. Typical questions that can be readily and inexpensively answered by this approach are:
* Has this chemical been sold as a pesticide in the USA?
* Is there a measured acute toxicity value for a particular air pollutant?
* Is information concerning a drug taken in overdose quantities and identified by gas chromatography-mass spectrometry in the Merck Index or the NIMH book on psychotropic drugs?
* Has a certain chemical been registered for sale in the USA?
Figure 1. The structure of the CIS with the CAS Registry number linking (CIS components-August, 1978)
FILE NUMBER OF COMPOUNDS
NIH/EPA-MSSS 25,560
C-13 NMR 3,765
EPA-ACTIVE INGREDIENTS IN PESTICIDES 1,454
PESTICIDES STANDARDS 384
ORD-CHEMICAL PRODUCERS 375
OIL AND HAZARDOUS MATERIALS 858
AEROS/SAROAD 65
AEROS/SOTDAT 572
STORET 234
CHEMICAL SPILLS 577
TSCA INVENTORY CANDIDATE LIST 33,579
NIMH-PSYCHOTROPIC DRUGS 1,689
SRI-PHS LIST 149 OF CARCINOGENS 4,448
NBS-SINGLE CRYSTAL FILE 18,362
HEATS OF FORMATION OF GASEOUS IONS 3,169
GAS-PHASE PROTON AFFINITIES 454
NSF-RANN POLLUTANT FILE 225
FDA-PESTICIDE REFERENCE STANDARDS 613
CPSC-CHEMRIC MONOGRAPHS 1,000
CAMBRIDGE UNIVERSITY CRYSTAL DATA 10,018
EROICA THERMODYNAMIC DATA 4,492
MERCK INDEX 8,894
ITC-INTERNATIONAL TRADE COMMISSION 9,194
NIOSH-REGISTRY OF TOXIC EFFECTS OF
CHEMICAL SUBSTANCES 19,908
NFPA-HAZARDOUS CHEMICALS 397
Figure 2. List of the current 25 collections which currently comprise the CIS unified data
base (integrated SANSS data base 3/1/78)
Among the data bases being added to the CIS this year are those shown in Figure 3. Over the next 2-3 years, with the continued addition of files that are either generated or used by the Government, it is expected that the list of referral files will grow to over 250. With the recent efforts of the four main Federal regulatory Agencies (EPA, FDA, CPSC, OSHA) to coordinate their various activities, such as the study and regulation of specific chemicals, this central referral system takes on more importance. This four-Agency group, known as the Interagency Regulatory Liaison Group (IRLG) (3), is now working to use the Chemical Abstracts Service (CAS) Registry Number as the standard chemical identifier for the chemicals in all the four Agencies. An internal regulation has been proposed which will make this mandatory. The regulation is modeled after EPA Order 2800.2, currently the only Government regulation to mandate standardized chemical classification (4).
Over the past four years, some 170,000 chemical names have been submitted to CAS, under
contract to EPA, to obtain the CAS Registry Numbers for these chemicals. The result of this
massive and costly effort is the CIS Unified Data Base (UDB) of about 101,000 unique
chemicals associated with the 25 files shown in Figure 2. That there is so much overlap of the
chemicals found in these files in not surprising. It is beginning to appear that there are relatively
few chemicals which are actually studied in any detail, and even fewer that become significant in
commerce, as, for example, drugs, food additives or pesticides. Projections suggest that by the
time the CAS registration process of some 250 files is completed, the actual size of the CIS
unified Data Base will not exceed 175,000-200,000 substances. The need then will be to obtain
as much useful and accurate information about these substances as is necessary to protect health
and environment in the USA, as is required by the missions of our respective Agencies. It is our
hope that by defining the size or scope of the "real" universe of chemicals, that the burden on
industry will be lessened and that future efforts will be easier to direct. Thus, we see little
immediate need to study the universe that CAS has defined, of over some 4,000,000 chemicals
found in the literature that CAS has abstracted since 1965. Only about 12% of these four million
have appeared more than once in the CAS abstracted literature and probably no more than 3%
are produced and sold in anything but research quantities.
US Coastguard Chemical EPA, List of Potentially Hazardous Chemicals
Properties File. from Coal and Oil.
EPA IERL Non-Criteria Pollutant California OSHA List of Chemical
Emissions. Contaminants.
EPA, Section 111A of the Clean WHO, Food and Agriculture
Air Act. Organization, List of Pesticides.
EPA, Office of Air Quality,
Permissible Standards, EPA, IERL, Organic Chemicals
Criteria Pollutants. in Air.
EPA, Office of Water Supply, File of NCI, Public List of Known
Drinking Water Pollutants. Carcinogens.
EPA, Pollutant Strategies Branch, NCTR, Potential Industrial
Selected Organic Air Pollutants Carcinogens and Mutagen
EPA, Effluent Guidelines Consent EPA,IERL, List of Environmental
Decree List. Carcinogens.
EPA, Section 112 of the Clean Air EPA, OPP, Pesticide Literature
Act. Searches.
EPA,ORD, Gulf Breeze, List of NIEHS, Laboratory Chemicals.
Chemicals.
EPA, OTS Status Asses. Toxic and Hazardous Industrial Chemicals Safety Manual.
EPA, Standing Air Monitoring International Technical Information
Work Group List of Non- Institute, Tokyo.
Criteria Pollutants.
EPA, ORD-OHEE Laboratory List of Teratogenic Chemicals.
Chemicals. Medical Information Center,
Karolinska Institute, Stockholm.
EPA, List of Hazardous Pesticides. EPA, Mutagenicity Studies.
CITT, List of Candidates. EPA, TSCA Section 8e, List of Chemicals.
Figure 3. New files being added to the NIH/EPA CIS UDB in Spring, 1978
Structure and Nomenclature Search System (SANSS)
The Structure and Nomenclature Search System (SANSS), the heart of the CIS, is based upon the work of Feldmann who developed the original search algorithms a number of years ago (5). Addition of a nomenclature search program, an identity search program and a search program based on the Edgewood CIDS structure keys (6), as well as some considerable refinement of the system has been carried out over the last few years. The SANSS and its data base, connection tables from CAS and chemical names, has absorbed the bulk of the CIS budget.
Currently, the SANSS can be used in a number of ways. The more important methods are:
* Nomenclature Search (NPROBE)
* Ring Search (RPROBE)
* Fragment Search (FPROBE)
* CIDS Code Search (SPROBE)
* Molecular Weight Search (MW)
* Molecular Formula Search (MF)
* Substructure Search (SUBSS)
* Full Structure Search (IDENT)
In addition to these searching programs, there are a number of retrieval and display options available in the system. These include:
* Display of Chemical Structure
* Display of CAS Collective Index names
* Display of synonyms, common names and
trade names
* Display of molecular formulas
* Display of files containing a substance
* Retrieval based upon CAS Registry Number
The following sections will be devoted to explaining the various SANSS nodules and giving
examples of how they can be used. At the end of the chapter an example of the interfacing of the
SANSS with the NIOSH RTECS data base of acute toxicity data (7) will be described, as an
example of the direction that CIS development is taking. Since there is considerable interest on
the part of the chemical industry in the implementation of TSCA, access to the bulk of the public
data that EPA will be using in its work for administering TSCA should be of value. At present,
development of the SANSS is being directed towards the immediate needs of EPA's Office of
Toxic Substance (OTS), so that the foundation that has been build for the SANSS can be used
most effectively for the implementation of TSCA.
Name - Nomenclature Search (NPROBE)
The name search, NPROBE, has been implemented as a result of requests expressed by both the SANSS user community and the CEQ-TSCA MITRE study proposal (8) for the development of a Chemical Structure and Nomenclature System which we have called the Structure and Nomenclature Search System. The software used is similar to that used in the CHEMLINE system at the National Library of Medicine (NLM) and allows for complete or partial (fragment) name search. There are an average of slightly over 3 names per chemical in CIS UDB, as opposed to slightly more than 2 names per chemical in CHEMLINE (9). The CHEMLINE file, which links primarily to the TOXLINE literature references, is made up mostly of research chemicals, and thus is not likely to have the multiple synonyms that are associated with commercial chemicals. In the CIS UDB, which is comprised of files from primarily regulatory, and hence commercial, sources, there are the expected additional names associated with materials in commerce.
To conduct a nomenclature search, the user simply enters a chemical name or name fragment,
as shown in Figure 4. The example shown in Figure 4 is of a search for any substance in the
UDB whose name contains the fragment "DDT." From Figure 4 it can be seen that there are 12
such substances in the UDB, of which the first, p,p' DDT, is shown in the Figure. In addition,
also shown in this figure are all the files of the UDB which contain information on p, p' DDT,
with the local file identifier numbers listed so that one may go directly to the particular file and
get the information that is contained in that file regarding p,p' DDT. In Figure 5, a name search
for the name fragment "LSD" was performed on the entire UDB and five examples were found.
The first of these five is shown in Figure 5, with the names of the files that have information
about LSD. Not surprisingly, the files include the NIMH List of Psychotropic Drugs, the Merck
Index and the NIOSH acute toxicity data base, as well as the NIH/EPA Mass Spectral Data Base
and the TSCA Candidate List. There is little doubt that the inclusion on the TSCA Candidate or
"Strawman" list will be changed once the final TSCA inventory is published, since under present
law, LSD is an illegal chemical substance. This is a useful search technique, but requires a large
list of synonyms, a correct spelling, and a knowledge of how chemical names are broken down.
For example, the searching for a cyclohexanedione, if the file name of the substance is written as
2, 5-dione, a search for "dione" will not find the chemical.
Functional Group - CIDS Key Search (SPROB)
The best way to search for functional groups or structure features in the CIS SANSS is to use the Chemical Information Data System (CIDS) keys, developed by Edgewood Arsenal. The CIDS keys, a few of which are shown in Figure 6, are the basis of a rapid and efficient way to search the CIS UDB for substances containing a particular functional group or structure feature. Many of the CIDS keys are quite specific in nature, as can be seen in Figure 6. Others, shown towards the bottom of Figure 6, are quite generic in nature. For example, the CIDS key FG25 refers to the presence of a nitrile or cyanide group in the molecule.
An example of a CIDS key search is given in Figure 7, where a search is shown for all
cyclohexyl (SCN49) morpholine (SCN35) compounds in the NIOSH RTECS data base of acute
toxicity. There are only two such compounds in the data base, and the first of these is printed out
in the figure, along with its local NIOSH RTECS identifier numbers indicated.
Figure 4. NPROBE name search for name fragment "DDT"
Figure 5. NPROBE name search for LSD
Figure 6. Sample CIDS key codes
Figure 7. CIDS key search for cyclohexyl morpholine compounds
Molecular Weight (MW) and Formula (MF) Search
In addition to searching for a particular functional group using the CIDS keys as shown above, it is possible to search for a compound, or a group of compounds, using molecular weight. The molecular weight search, shown in Figure 8, allows for either a specific molecular weight, or, as is indicated in the figure, a range of molecular weights. In the particular example shown in Figure 8, the Merck Index is being searched for all occurrences of compounds with a molecular weight between 368 and 380. There are 167 such substances as can be seen in the top part of Figure 8. This is too large a number and so it was decided to try to narrow or filter the search down to a smaller number using a molecular formula search. In this case what was really sought were all compounds which have two oxygen atoms and a molecular weight between 368 and 380. In figure 8 a search for this partial formula (02) is shown, and this is followed by a Boolean AND logic operation (INTERsect) between the file of 167 compounds with the correct molecular weight range and the file of 1484 having the correct partial formula. The result of this AND operation is a file containing the 16 compounds in the Merck Index which have a molecular weight between 368 and 380 as well as exactly two oxygen atoms in the molecule. At the bottom of Figure 8, the first of the 16 answers is printed out. This compound, with a molecular formula of C21.H23.C1F.N.02 and a molecular weight of 375, is Haloperidol, which is a drug used as a sedative and tranquilizer.
In the event that there is no interest in chlorinated compounds, even though they may meet
the molecular weight and molecular formula criteria, a further molecular formula search may be
conducted, as show in Figure 9, for compounds with 1-4 chlorine atoms. From Figure 9, it can
be seen that there are 986 compounds with 1-4 chlorine atoms in the Merck Index file. Since the
requirement was for compounds that did not contain this halogen atom, a Boolean NOT
operation between the 986 chlorine containing compounds and the 16 compounds previously
found is performed, as seen in the center of Figure 9. This results in the removal of three of the
sixteen substances, and of the remaining thirteen, the first one, Androsta-3, 5-dien-17-ol, 3-(cyclopentyloxy)-17-methyl-, (17.beta.), is printed out and shown at the bottom of Figure 9.
This, of course, like the other twelve in the file, does not contain the chlorine that was present in
three of the answers to the first search shown in Figure 8. The ability to interact and impose
various limitations and filters on searching is a very powerful capability of the SANSS.
Figure 8. Molecular-weight range search
Figure 9. Sample of combination searches of MF, MW with NOT logic
Nucleus - Ring Search (RPROBE)
One of the features of the CIS SANSS that has made the system useful is the structure of the
file with respect to ring system. The SANSS has a hierarchical file structure that allows for rapid
and inexpensive searching for specific rings or ring system. In figure 10, a list of some of the
commands used to generate structures are given. To show the SANSS works and how one can
use the various query modules, the remainder of the chapter will be devoted to searching through
the NIOSH TTECS data base for chemicals having an aromatic ring, substituted on ortho carbons
with chlorine and bromine respectively. The first thing that must be done in order to perform
such a search is to build the "query" structure that is to sought. This is done with the first few
commands shown in Figure 11. The query structure in Figure 11 is a chloro bromo (ortho)
substituted benzene ring, but the ring probe search will be conducted for any ortho disubstituted
aromatic ring, since it does not take into account the nature of the substituents. Also, since other
substituents on the benzene ring will be permitted, it is necessary to reset the substituent search
level from "EXACT" (only two substituents and these must be ortho to "EMBED" (there must
be two ortho substituents at a minimum.) The command to do this is EXIM, which is short for
Exact/EMBED switch. The search shown in Figure 11 reveals that there are 2715 compounds in
the NIOSH RTECS file that contain at least this ring pattern. To filter such potentially broad
responses further, one can use CIDS keys searches and other such constraints as shown below.
Fragment Search (FPROBE)
One feature necessary to any structure search system is the ability to search for atom-centered fragments. In a fragment search the user must specify an atom and its neighbors. The exact (or generic) nature of the bonds between this central atom and each of its neighbors is then entered and a search is conducted for all occurrences of such a fragment. If a query structure has already been generated, as was done in Figure 11, that structure can be used by the SANSS program to generate and search for fragments. There are usually a number of atoms in a query structure that can be considered as central to a fragment. Hence, a request for a fragment probe of the substructure shown in Figure 11 would lead to searches for six fragments, four of which would be the same (i.e. atom centered fragments about atoms 3, 4, 5 and 6 are all the same, representing a carbon atom in an aromatic ring attached to two other aromatic carbon atoms in the ring and a hydrogen.) Such fragments are not very specific, and so it is best to identify the atom centered fragment for which one wishes to search. In Figure 12, atom number 1 is selected and a search for all occurrences of a chlorine atom on an aromatic ring is performed. The result of this search is a file containing all 1618 compounds in the NIOSH RTECS file that contain this particular structure fragment.
After the fragment search is conducted for the chloro aromatic fragment, a similar search is
performed on the fragment centered about atom 2, which contains a bromo substituent. This
fragment probe (FPROBE) search, shown in Figure 13, results in 229 occurrences of this
fragment in compounds in the NIOSH RTECS data base.
Figure 10. Commands used to generate structures for searching
Figure 11. A ring-probe (RPROBE) search for a disubstituted benzene
Figure 12. A fragment probe (FPROBE) for a chlorine
atom attached to an aromatic carbon atom
Figure 13. A fragment probe (FPROBE) for a bromine atom
attached to an aromatic carbon atom
Figure 14. Intersection and substructure search of files
derived in Figures 11-13
Figure 15. One of seven substructure search hits
Figure 16. Example of IDENT search for a complete molecule
Figure 17. Example of NIOSH RTECS toxicity data retrieval
Substructure Search (SUBSS)
The Substructure Search option is an atom-by-atom, bond-by-bond comparison between
connection tables in the data base and the connection tables corresponding to the query structure.
This time consuming, sequential search is quite costly, and so the ring probe, fragment probe,
and other search techniques described above are used as screen to speed up the process and
reduce the cost. Following the three separate searches done in Figures 11-13, the next step is to
see which compounds in the NIOSH RTECS data base contain occurrences of all three. This is
done by a simple Boolean AND logic combination of the three lists of Registry Numbers
generated by the searches in these Figures. The intersection of the lists, performed by the INTER
command as shown in Figure 14, results in 12 compounds meeting the criteria of all three
searches. However, not necessarily all of the 12 answers are precisely what is wanted. This is
because the three searches in Figures 11-13 are for "pieces" of the structure sought but the
searches do not require these pieces to be in the same juxtaposition as in the query structure.
This is, the three requirements comprise as necessary, but no sufficient condition for an answer
to the original question. To secure an exact answer as to how many (if any) of these 12
compounds meet the exact query structure, it is necessary to perform a true substructure search
(SUBSS) as is shown in Figure 14. The result of the use of SUBSS shows that only 7 of 12
"answers" from the intersection of the three searches do have the bromine and chlorine ortho to
one another on the benzene ring. Of the 7 answers, one is shown in Figure 15. As it turns out
from inspection of all 12 prior answers (not shown here), the other compounds retrieved are meta
substituted chloro bromo aromatic compounds.
Complete Structure Search (IDENT)
The final SANSS module to be described in this chapter is the search for a total or full
structure, rather than a substructure. This module was designed primarily for the purpose of
searching for and reporting specific chemicals as part of the TSCA inventory reporting
procedures. The full structure search, called IDENT (for IDENTity), has and will continue to
have specific application to TSCA activities. For example, after the final "grandfather"
inventory required under section 8 of the Act is published and made available, via the CIS, as
well as by other means, it will be necessary for potential vendors of a chemical to determine if
the chemical they wish to see or manufacture is in the Inventory and can thus be produced and
marketed without extensive pre-manufacturing testing. Use of the IDENT search will quickly
reveal if the chemical is in the TSCA inventory. Of course, one can use the name search
capabilities, but there is no guarantee that the name used by the manufacturer will be in the list of
synonyms associated with the inventory. The structure shown in Figure 16 was generated using
the standard SANSS structure generation commands, such as those listed in Figure 10. The
IDENT search was then invoked and after being told that the structure had the normal number of
hydrogen atoms, consistent with normal valence, it found the structure in the CIS UDB. The
structure was then printed out, with all the local file identifier information, as well as a number of
synonyms, one of which is the TSCA Clerical Code Designation number for the substance.
SANSS-Data Base Interfaces
A structure or a nomenclature search is generally only a means to an end. The end is often
some data associated with the structures found. In order to facilitate retrieval of such
information, an interface between the CIS numeric data bases and the SANSS has been
constructed. This allows for a search through the UDB followed by a data search (or retrieval)
and permits one to answer such queries as:
* Do any ortho bromo-chloro aromatic compounds have a toxicity greater than 1.0 mg./kg?
In the example shown in Figure 17, the first three answers from the previous search are used
to retrieve the toxicity data associated with these compounds. The automatic interface between
the system is invoked by the command TSHOW and then the previous file of 7 CAS Registry
Numbers, generated by SUBSSS, are specified, with only the first three being printed out upon
request.
Summary
The NIH/EPA CIS has developed to the point where complex questions can be readily
answered. The ability to manipulate structure and numeric data and establish correlation
between the two should be of considerable value to the EPA in its work under the Toxic
Substances Control Act, as well as to scientist in general. The value of the SANSS linked to
CNMR data has been recently shown (10), and no doubt other structure-data studies will be
undertaken now that the necessary groundwork has been laid.
Acknowledgments
The authors wish to thank the following for their help and cooperation in developing the CIS SANSS: R. J. Feldmann, W. Greenstreet, M. Yaguda, M. Bracken, A. Fein, G. Marquart, and
J. Miller.
Literature Cited
1. Heller, S. R., Milne, G. W. A., and Feldmann, R. J., Science, (1997), 195, 253.
2. Feldmann, R. J., Milne, G. W. A., Heller, S. R., Fein, A., Miller, J. A., and Koch, B., J. Chem. Info. and Comp. Sci., (1977), 17, 157.
3. The Interagency Regulatory Liaison Group (IRLG) was established August 2, 1977, by the following four Agencies: EPA, FDA, OSHA and CPSC.
4. EPA Order #2800.2, issued May 27, 1975.
5. Feldmann, R. J., and Heller, S. R., J. Chem. Doc., (1972), 12, 48.
6. CIDS Structure Feature Key Code Manual is available from CIS Project, Chemistry Department, Brookhaven National Laboratory, Upton, Long Island, New York 11973.
7. NIOSH, Registry of Toxic Effects of Chemical Substances (RTECS), 1977. Available from the US Government Printing Office, GPO Order Number 017-033-0027101; $17.50 per copy USA: $21.88 per copy non-USA.
8. Bracken, M., Dorigan, J., Hushon, J., and Overbey, II, J., MITRE Reprint MIR-7578 to CEQ, June 1977. Two volumes entitled "Chemical Substances Information Network (CSIN)."
9. NLM Fact Sheet for the Toxicology Information Program, January 1978.
10. Milne, G. W. A., Zupan, J., Heller, S. R., and Miller, J. A., Anal. Chim. Acta, In press
(1978).
Received August 29, 1978.