The NIH/EPA Chemical Information System

STEPHEN R. HELLER

Environmental Protection Agency, PM-218

Washington, DC 20460

G. W. A. MILNE

National Institutes of Health

Bethesda, MD 20014



Over the past seven years, NIH and EPA have developed a computer based Chemical Information System (CIS), which is an online interactive computer system that handles chemical and toxicological data (1). The CIS consists mainly of a collection of numeric (as opposed to bibliographic) data bases and software to search these data bases. The four main areas of the CIS can be grouped as follows:

1. Searchable numeric data bases

2. Structure and Nomenclature Search system (SANSS)

3. Chemical Substance Information System (CSIS)

4. Analysis and Modeling Programs

The first three areas will be described, with emphasis on the linking of areas 1 and 2.

Figure 1 shows how the four areas of the CIS are coordinated, with the Structure and Nomenclature Search System (SANSS) in the center. At present there are 25 data bases in the SANSS. These comprise the CIS Unified Data Base (UDB) and are searchable by the SANSS (2). They are shown in Figure 2. The referral aspects of the CIS represent a valuable tool for scientific and administrative work both within our respective Agencies as well as outside these Agencies, in the public and private sector, here in the USA and abroad. The referral capability of the CIS consists of a list of data bases, literature references (e.g., Merck Index) and Government Regulatory files, which can all be accessed simultaneously by consulting a single central file. All the available information concerning a substance can be located in a single operation. As a number of data bases is increased, the CIS becomes more valuable and a time-saving device in searches for chemical information. Typical questions that can be readily and inexpensively answered by this approach are:

* Has this chemical been sold as a pesticide in the USA?

* Is there a measured acute toxicity value for a particular air pollutant?

* Is information concerning a drug taken in overdose quantities and identified by gas chromatography-mass spectrometry in the Merck Index or the NIMH book on psychotropic drugs?

* Has a certain chemical been registered for sale in the USA?

Figure 1. The structure of the CIS with the CAS Registry number linking (CIS components-August, 1978)












FILE NUMBER OF COMPOUNDS




NIH/EPA-MSSS 25,560

C-13 NMR 3,765

EPA-ACTIVE INGREDIENTS IN PESTICIDES 1,454

PESTICIDES STANDARDS 384

ORD-CHEMICAL PRODUCERS 375

OIL AND HAZARDOUS MATERIALS 858

AEROS/SAROAD 65

AEROS/SOTDAT 572

STORET 234

CHEMICAL SPILLS 577

TSCA INVENTORY CANDIDATE LIST 33,579

NIMH-PSYCHOTROPIC DRUGS 1,689

SRI-PHS LIST 149 OF CARCINOGENS 4,448

NBS-SINGLE CRYSTAL FILE 18,362

HEATS OF FORMATION OF GASEOUS IONS 3,169

GAS-PHASE PROTON AFFINITIES 454

NSF-RANN POLLUTANT FILE 225

FDA-PESTICIDE REFERENCE STANDARDS 613

CPSC-CHEMRIC MONOGRAPHS 1,000

CAMBRIDGE UNIVERSITY CRYSTAL DATA 10,018

EROICA THERMODYNAMIC DATA 4,492

MERCK INDEX 8,894

ITC-INTERNATIONAL TRADE COMMISSION 9,194

NIOSH-REGISTRY OF TOXIC EFFECTS OF

CHEMICAL SUBSTANCES 19,908

NFPA-HAZARDOUS CHEMICALS 397

Figure 2. List of the current 25 collections which currently comprise the CIS unified data base (integrated SANSS data base 3/1/78)

Among the data bases being added to the CIS this year are those shown in Figure 3. Over the next 2-3 years, with the continued addition of files that are either generated or used by the Government, it is expected that the list of referral files will grow to over 250. With the recent efforts of the four main Federal regulatory Agencies (EPA, FDA, CPSC, OSHA) to coordinate their various activities, such as the study and regulation of specific chemicals, this central referral system takes on more importance. This four-Agency group, known as the Interagency Regulatory Liaison Group (IRLG) (3), is now working to use the Chemical Abstracts Service (CAS) Registry Number as the standard chemical identifier for the chemicals in all the four Agencies. An internal regulation has been proposed which will make this mandatory. The regulation is modeled after EPA Order 2800.2, currently the only Government regulation to mandate standardized chemical classification (4).

Over the past four years, some 170,000 chemical names have been submitted to CAS, under contract to EPA, to obtain the CAS Registry Numbers for these chemicals. The result of this massive and costly effort is the CIS Unified Data Base (UDB) of about 101,000 unique chemicals associated with the 25 files shown in Figure 2. That there is so much overlap of the chemicals found in these files in not surprising. It is beginning to appear that there are relatively few chemicals which are actually studied in any detail, and even fewer that become significant in commerce, as, for example, drugs, food additives or pesticides. Projections suggest that by the time the CAS registration process of some 250 files is completed, the actual size of the CIS unified Data Base will not exceed 175,000-200,000 substances. The need then will be to obtain as much useful and accurate information about these substances as is necessary to protect health and environment in the USA, as is required by the missions of our respective Agencies. It is our hope that by defining the size or scope of the "real" universe of chemicals, that the burden on industry will be lessened and that future efforts will be easier to direct. Thus, we see little immediate need to study the universe that CAS has defined, of over some 4,000,000 chemicals found in the literature that CAS has abstracted since 1965. Only about 12% of these four million have appeared more than once in the CAS abstracted literature and probably no more than 3% are produced and sold in anything but research quantities.





US Coastguard Chemical EPA, List of Potentially Hazardous Chemicals

Properties File. from Coal and Oil.

EPA IERL Non-Criteria Pollutant California OSHA List of Chemical

Emissions. Contaminants.

EPA, Section 111A of the Clean WHO, Food and Agriculture

Air Act. Organization, List of Pesticides.

EPA, Office of Air Quality,

Permissible Standards, EPA, IERL, Organic Chemicals

Criteria Pollutants. in Air.

EPA, Office of Water Supply, File of NCI, Public List of Known

Drinking Water Pollutants. Carcinogens.

EPA, Pollutant Strategies Branch, NCTR, Potential Industrial

Selected Organic Air Pollutants Carcinogens and Mutagen

EPA, Effluent Guidelines Consent EPA,IERL, List of Environmental

Decree List. Carcinogens.

EPA, Section 112 of the Clean Air EPA, OPP, Pesticide Literature

Act. Searches.

EPA,ORD, Gulf Breeze, List of NIEHS, Laboratory Chemicals.

Chemicals.

EPA, OTS Status Asses. Toxic and Hazardous Industrial Chemicals Safety Manual.

EPA, Standing Air Monitoring International Technical Information

Work Group List of Non- Institute, Tokyo.

Criteria Pollutants.

EPA, ORD-OHEE Laboratory List of Teratogenic Chemicals.

Chemicals. Medical Information Center,

Karolinska Institute, Stockholm.

EPA, List of Hazardous Pesticides. EPA, Mutagenicity Studies.

CITT, List of Candidates. EPA, TSCA Section 8e, List of Chemicals.

Figure 3. New files being added to the NIH/EPA CIS UDB in Spring, 1978

Structure and Nomenclature Search System (SANSS)

The Structure and Nomenclature Search System (SANSS), the heart of the CIS, is based upon the work of Feldmann who developed the original search algorithms a number of years ago (5). Addition of a nomenclature search program, an identity search program and a search program based on the Edgewood CIDS structure keys (6), as well as some considerable refinement of the system has been carried out over the last few years. The SANSS and its data base, connection tables from CAS and chemical names, has absorbed the bulk of the CIS budget.

Currently, the SANSS can be used in a number of ways. The more important methods are:

* Nomenclature Search (NPROBE)

* Ring Search (RPROBE)

* Fragment Search (FPROBE)

* CIDS Code Search (SPROBE)

* Molecular Weight Search (MW)

* Molecular Formula Search (MF)

* Substructure Search (SUBSS)

* Full Structure Search (IDENT)

In addition to these searching programs, there are a number of retrieval and display options available in the system. These include:

* Display of Chemical Structure

* Display of CAS Collective Index names

* Display of synonyms, common names and

trade names

* Display of molecular formulas

* Display of files containing a substance

* Retrieval based upon CAS Registry Number

The following sections will be devoted to explaining the various SANSS nodules and giving examples of how they can be used. At the end of the chapter an example of the interfacing of the SANSS with the NIOSH RTECS data base of acute toxicity data (7) will be described, as an example of the direction that CIS development is taking. Since there is considerable interest on the part of the chemical industry in the implementation of TSCA, access to the bulk of the public data that EPA will be using in its work for administering TSCA should be of value. At present, development of the SANSS is being directed towards the immediate needs of EPA's Office of Toxic Substance (OTS), so that the foundation that has been build for the SANSS can be used most effectively for the implementation of TSCA.

Name - Nomenclature Search (NPROBE)

The name search, NPROBE, has been implemented as a result of requests expressed by both the SANSS user community and the CEQ-TSCA MITRE study proposal (8) for the development of a Chemical Structure and Nomenclature System which we have called the Structure and Nomenclature Search System. The software used is similar to that used in the CHEMLINE system at the National Library of Medicine (NLM) and allows for complete or partial (fragment) name search. There are an average of slightly over 3 names per chemical in CIS UDB, as opposed to slightly more than 2 names per chemical in CHEMLINE (9). The CHEMLINE file, which links primarily to the TOXLINE literature references, is made up mostly of research chemicals, and thus is not likely to have the multiple synonyms that are associated with commercial chemicals. In the CIS UDB, which is comprised of files from primarily regulatory, and hence commercial, sources, there are the expected additional names associated with materials in commerce.

To conduct a nomenclature search, the user simply enters a chemical name or name fragment, as shown in Figure 4. The example shown in Figure 4 is of a search for any substance in the UDB whose name contains the fragment "DDT." From Figure 4 it can be seen that there are 12 such substances in the UDB, of which the first, p,p' DDT, is shown in the Figure. In addition, also shown in this figure are all the files of the UDB which contain information on p, p' DDT, with the local file identifier numbers listed so that one may go directly to the particular file and get the information that is contained in that file regarding p,p' DDT. In Figure 5, a name search for the name fragment "LSD" was performed on the entire UDB and five examples were found. The first of these five is shown in Figure 5, with the names of the files that have information about LSD. Not surprisingly, the files include the NIMH List of Psychotropic Drugs, the Merck Index and the NIOSH acute toxicity data base, as well as the NIH/EPA Mass Spectral Data Base and the TSCA Candidate List. There is little doubt that the inclusion on the TSCA Candidate or "Strawman" list will be changed once the final TSCA inventory is published, since under present law, LSD is an illegal chemical substance. This is a useful search technique, but requires a large list of synonyms, a correct spelling, and a knowledge of how chemical names are broken down. For example, the searching for a cyclohexanedione, if the file name of the substance is written as 2, 5-dione, a search for "dione" will not find the chemical.

Functional Group - CIDS Key Search (SPROB)

The best way to search for functional groups or structure features in the CIS SANSS is to use the Chemical Information Data System (CIDS) keys, developed by Edgewood Arsenal. The CIDS keys, a few of which are shown in Figure 6, are the basis of a rapid and efficient way to search the CIS UDB for substances containing a particular functional group or structure feature. Many of the CIDS keys are quite specific in nature, as can be seen in Figure 6. Others, shown towards the bottom of Figure 6, are quite generic in nature. For example, the CIDS key FG25 refers to the presence of a nitrile or cyanide group in the molecule.

An example of a CIDS key search is given in Figure 7, where a search is shown for all cyclohexyl (SCN49) morpholine (SCN35) compounds in the NIOSH RTECS data base of acute toxicity. There are only two such compounds in the data base, and the first of these is printed out in the figure, along with its local NIOSH RTECS identifier numbers indicated.



















Figure 4. NPROBE name search for name fragment "DDT"



Figure 5. NPROBE name search for LSD

Figure 6. Sample CIDS key codes

Figure 7. CIDS key search for cyclohexyl morpholine compounds

Molecular Weight (MW) and Formula (MF) Search

In addition to searching for a particular functional group using the CIDS keys as shown above, it is possible to search for a compound, or a group of compounds, using molecular weight. The molecular weight search, shown in Figure 8, allows for either a specific molecular weight, or, as is indicated in the figure, a range of molecular weights. In the particular example shown in Figure 8, the Merck Index is being searched for all occurrences of compounds with a molecular weight between 368 and 380. There are 167 such substances as can be seen in the top part of Figure 8. This is too large a number and so it was decided to try to narrow or filter the search down to a smaller number using a molecular formula search. In this case what was really sought were all compounds which have two oxygen atoms and a molecular weight between 368 and 380. In figure 8 a search for this partial formula (02) is shown, and this is followed by a Boolean AND logic operation (INTERsect) between the file of 167 compounds with the correct molecular weight range and the file of 1484 having the correct partial formula. The result of this AND operation is a file containing the 16 compounds in the Merck Index which have a molecular weight between 368 and 380 as well as exactly two oxygen atoms in the molecule. At the bottom of Figure 8, the first of the 16 answers is printed out. This compound, with a molecular formula of C21.H23.C1F.N.02 and a molecular weight of 375, is Haloperidol, which is a drug used as a sedative and tranquilizer.

In the event that there is no interest in chlorinated compounds, even though they may meet the molecular weight and molecular formula criteria, a further molecular formula search may be conducted, as show in Figure 9, for compounds with 1-4 chlorine atoms. From Figure 9, it can be seen that there are 986 compounds with 1-4 chlorine atoms in the Merck Index file. Since the requirement was for compounds that did not contain this halogen atom, a Boolean NOT operation between the 986 chlorine containing compounds and the 16 compounds previously found is performed, as seen in the center of Figure 9. This results in the removal of three of the sixteen substances, and of the remaining thirteen, the first one, Androsta-3, 5-dien-17-ol, 3-(cyclopentyloxy)-17-methyl-, (17.beta.), is printed out and shown at the bottom of Figure 9. This, of course, like the other twelve in the file, does not contain the chlorine that was present in three of the answers to the first search shown in Figure 8. The ability to interact and impose various limitations and filters on searching is a very powerful capability of the SANSS.

























Figure 8. Molecular-weight range search

Figure 9. Sample of combination searches of MF, MW with NOT logic



Nucleus - Ring Search (RPROBE)

One of the features of the CIS SANSS that has made the system useful is the structure of the file with respect to ring system. The SANSS has a hierarchical file structure that allows for rapid and inexpensive searching for specific rings or ring system. In figure 10, a list of some of the commands used to generate structures are given. To show the SANSS works and how one can use the various query modules, the remainder of the chapter will be devoted to searching through the NIOSH TTECS data base for chemicals having an aromatic ring, substituted on ortho carbons with chlorine and bromine respectively. The first thing that must be done in order to perform such a search is to build the "query" structure that is to sought. This is done with the first few commands shown in Figure 11. The query structure in Figure 11 is a chloro bromo (ortho) substituted benzene ring, but the ring probe search will be conducted for any ortho disubstituted aromatic ring, since it does not take into account the nature of the substituents. Also, since other substituents on the benzene ring will be permitted, it is necessary to reset the substituent search level from "EXACT" (only two substituents and these must be ortho to "EMBED" (there must be two ortho substituents at a minimum.) The command to do this is EXIM, which is short for Exact/EMBED switch. The search shown in Figure 11 reveals that there are 2715 compounds in the NIOSH RTECS file that contain at least this ring pattern. To filter such potentially broad responses further, one can use CIDS keys searches and other such constraints as shown below.

Fragment Search (FPROBE)

One feature necessary to any structure search system is the ability to search for atom-centered fragments. In a fragment search the user must specify an atom and its neighbors. The exact (or generic) nature of the bonds between this central atom and each of its neighbors is then entered and a search is conducted for all occurrences of such a fragment. If a query structure has already been generated, as was done in Figure 11, that structure can be used by the SANSS program to generate and search for fragments. There are usually a number of atoms in a query structure that can be considered as central to a fragment. Hence, a request for a fragment probe of the substructure shown in Figure 11 would lead to searches for six fragments, four of which would be the same (i.e. atom centered fragments about atoms 3, 4, 5 and 6 are all the same, representing a carbon atom in an aromatic ring attached to two other aromatic carbon atoms in the ring and a hydrogen.) Such fragments are not very specific, and so it is best to identify the atom centered fragment for which one wishes to search. In Figure 12, atom number 1 is selected and a search for all occurrences of a chlorine atom on an aromatic ring is performed. The result of this search is a file containing all 1618 compounds in the NIOSH RTECS file that contain this particular structure fragment.

After the fragment search is conducted for the chloro aromatic fragment, a similar search is performed on the fragment centered about atom 2, which contains a bromo substituent. This fragment probe (FPROBE) search, shown in Figure 13, results in 229 occurrences of this fragment in compounds in the NIOSH RTECS data base.

Figure 10. Commands used to generate structures for searching

Figure 11. A ring-probe (RPROBE) search for a disubstituted benzene





































Figure 12. A fragment probe (FPROBE) for a chlorine

atom attached to an aromatic carbon atom

Figure 13. A fragment probe (FPROBE) for a bromine atom

attached to an aromatic carbon atom

Figure 14. Intersection and substructure search of files

derived in Figures 11-13

Figure 15. One of seven substructure search hits

Figure 16. Example of IDENT search for a complete molecule

Figure 17. Example of NIOSH RTECS toxicity data retrieval



Substructure Search (SUBSS)

The Substructure Search option is an atom-by-atom, bond-by-bond comparison between connection tables in the data base and the connection tables corresponding to the query structure. This time consuming, sequential search is quite costly, and so the ring probe, fragment probe, and other search techniques described above are used as screen to speed up the process and reduce the cost. Following the three separate searches done in Figures 11-13, the next step is to see which compounds in the NIOSH RTECS data base contain occurrences of all three. This is done by a simple Boolean AND logic combination of the three lists of Registry Numbers generated by the searches in these Figures. The intersection of the lists, performed by the INTER command as shown in Figure 14, results in 12 compounds meeting the criteria of all three searches. However, not necessarily all of the 12 answers are precisely what is wanted. This is because the three searches in Figures 11-13 are for "pieces" of the structure sought but the searches do not require these pieces to be in the same juxtaposition as in the query structure. This is, the three requirements comprise as necessary, but no sufficient condition for an answer to the original question. To secure an exact answer as to how many (if any) of these 12 compounds meet the exact query structure, it is necessary to perform a true substructure search (SUBSS) as is shown in Figure 14. The result of the use of SUBSS shows that only 7 of 12 "answers" from the intersection of the three searches do have the bromine and chlorine ortho to one another on the benzene ring. Of the 7 answers, one is shown in Figure 15. As it turns out from inspection of all 12 prior answers (not shown here), the other compounds retrieved are meta substituted chloro bromo aromatic compounds.

Complete Structure Search (IDENT)

The final SANSS module to be described in this chapter is the search for a total or full structure, rather than a substructure. This module was designed primarily for the purpose of searching for and reporting specific chemicals as part of the TSCA inventory reporting procedures. The full structure search, called IDENT (for IDENTity), has and will continue to have specific application to TSCA activities. For example, after the final "grandfather" inventory required under section 8 of the Act is published and made available, via the CIS, as well as by other means, it will be necessary for potential vendors of a chemical to determine if the chemical they wish to see or manufacture is in the Inventory and can thus be produced and marketed without extensive pre-manufacturing testing. Use of the IDENT search will quickly reveal if the chemical is in the TSCA inventory. Of course, one can use the name search capabilities, but there is no guarantee that the name used by the manufacturer will be in the list of synonyms associated with the inventory. The structure shown in Figure 16 was generated using the standard SANSS structure generation commands, such as those listed in Figure 10. The IDENT search was then invoked and after being told that the structure had the normal number of hydrogen atoms, consistent with normal valence, it found the structure in the CIS UDB. The structure was then printed out, with all the local file identifier information, as well as a number of synonyms, one of which is the TSCA Clerical Code Designation number for the substance.

SANSS-Data Base Interfaces

A structure or a nomenclature search is generally only a means to an end. The end is often some data associated with the structures found. In order to facilitate retrieval of such information, an interface between the CIS numeric data bases and the SANSS has been constructed. This allows for a search through the UDB followed by a data search (or retrieval) and permits one to answer such queries as:

* Do any ortho bromo-chloro aromatic compounds have a toxicity greater than 1.0 mg./kg?

In the example shown in Figure 17, the first three answers from the previous search are used to retrieve the toxicity data associated with these compounds. The automatic interface between the system is invoked by the command TSHOW and then the previous file of 7 CAS Registry Numbers, generated by SUBSSS, are specified, with only the first three being printed out upon request.

Summary

The NIH/EPA CIS has developed to the point where complex questions can be readily answered. The ability to manipulate structure and numeric data and establish correlation between the two should be of considerable value to the EPA in its work under the Toxic Substances Control Act, as well as to scientist in general. The value of the SANSS linked to CNMR data has been recently shown (10), and no doubt other structure-data studies will be undertaken now that the necessary groundwork has been laid.

Acknowledgments

The authors wish to thank the following for their help and cooperation in developing the CIS SANSS: R. J. Feldmann, W. Greenstreet, M. Yaguda, M. Bracken, A. Fein, G. Marquart, and

J. Miller.

Literature Cited

1. Heller, S. R., Milne, G. W. A., and Feldmann, R. J., Science, (1997), 195, 253.

2. Feldmann, R. J., Milne, G. W. A., Heller, S. R., Fein, A., Miller, J. A., and Koch, B., J. Chem. Info. and Comp. Sci., (1977), 17, 157.

3. The Interagency Regulatory Liaison Group (IRLG) was established August 2, 1977, by the following four Agencies: EPA, FDA, OSHA and CPSC.

4. EPA Order #2800.2, issued May 27, 1975.

5. Feldmann, R. J., and Heller, S. R., J. Chem. Doc., (1972), 12, 48.

6. CIDS Structure Feature Key Code Manual is available from CIS Project, Chemistry Department, Brookhaven National Laboratory, Upton, Long Island, New York 11973.

7. NIOSH, Registry of Toxic Effects of Chemical Substances (RTECS), 1977. Available from the US Government Printing Office, GPO Order Number 017-033-0027101; $17.50 per copy USA: $21.88 per copy non-USA.

8. Bracken, M., Dorigan, J., Hushon, J., and Overbey, II, J., MITRE Reprint MIR-7578 to CEQ, June 1977. Two volumes entitled "Chemical Substances Information Network (CSIN)."

9. NLM Fact Sheet for the Toxicology Information Program, January 1978.

10. Milne, G. W. A., Zupan, J., Heller, S. R., and Miller, J. A., Anal. Chim. Acta, In press (1978).

Received August 29, 1978.