A COMPUTER-BASED TOXICOLOGY SEARCH

SYSTEM



J. R. McGill

Fein-Marquart Associates,

7215 York Rd.,

Baltimore, Md., 21212



S. R. Heller

Environmental Protection Agency, PM-218,

Washington, D.C., 20460



G. W. A. Milne

National Institutes of Health,

Bethesda, Md., 20014



An interactive computer search system, based on the National Institute for Occupational Safety and Health's Registry of Toxic Effects of Chemicals (NIOSH-RTECS) has been developed. This system permits the location and retrieval of specified toxicity data defined by test animal, dosage method, toxicity level, and compound identity. All available toxicity data for a given chemical substance, identified by name or structure, may be retrieved using either the Chemical Abstracts Service (CAS) Registry Number or the RTECS Accession Number for that compound. The search system is running upon an international computer network, and may be used by anyone interested on a fee-for-service basis.



INTRODUCTION



Under the Occupational Safety and Health Act of 1970 (PL 91-596), the National Institute for Occupational Safety and Health (NIOSH) was required to compile and maintain a list of toxic substances which is called the Registry of Toxic Effects of Chemical Substances (RTECS).



This Registry, which is revised annually, contains toxicity data for over 30,000 different chemical substances. The data are taken primarily from the open literature. Some 900 separate journals and other sources were examined in the preparation of the 1976 edition. The bulk of the toxicity data in the RTECS deals primarily with short-term or acute effects, of research chemicals. Thus a large number of the entries provide measures of toxicity such as the dose lethal to 50 percent of the population (LD50). Carcinogenic, teratogenic and mutagenic chronic toxicity data are provided for some substances.



The toxicities provided in the RTECS are collected non-critically in the sense that data in the scientific literature are taken as published. The major check for validity of such data is therefore in the editorial process of the journal that originally published them.



The 1976 edition of RTECS, upon which the current search system described here is based, contains data for 30,105 unique chemical substances. In 1976, it was estimated by NIOSH that toxicity data exist for perhaps 100,000 chemicals. The RTECS therefore covers about a fifth of the established body of information, although it appears that only 10-35 percent of the entries concern chemicals that are significant in commerce. This estimate is based upon the observed degree of overlap between the RTECS and the file of the International Trade Commission (2,271, or 11.4 percent of the RTECS) and between the RTECS and the Toxic Substances Control Act (TSCA) Candidate List (6,414, or 32.2 percent of the RTECS). Thus it is clear that the RTECS represents a uniquely valuable resource for those concerned with the potential effect of toxic chemicals upon the environment.



For these reasons, it was decided during 1977 to develop an on-line search system that could retrieve information from the RTECS. The system was designed as a part of the NIH-EPA Chemical Information (Heller et al.), a large interactive data retrieval system that is used by the international scientific community via networked computers. This paper describes the design and use of the search system that has emerged from this work.





DISCUSSION



The RTECS reports toxicity data only for whole animal studies. There are a total of 28 different animal species cited in the data base, and these are given in Table 1. Either the abbreviations in Table 1 may be used in defining search criteria, or the full names may be used. If the user does not know the abbreviation of the name or is unsure whether a particular species is to be found in the RTECS, the permitted species in full and abbreviated form may be listed by the command 'HELP CODES.' A number of animal groupings (Table 1) are also allowed. With these group codes, one can, for example, retrieve all human toxicity data, which is comprised of all data for infants, children, men, women and otherwise unidentified human groups.



There are 24 routes of administration recognized in the data base. These are all referred to by the standard three-letter codes given in Table 2. As with the animal names, the full word or the code can be used in a search. Both the codes and the full words can be listed using the command 'HELP CODES.'



For all of the 30,105 different chemical substances covered by the RTECS, there is given one or more actual measurements of toxicity. There are six types of measurement reported in the RTECS. Three of these are used only for the inhalation route of exposure: These are the lower limit of lethal concentration (LCLO), the concentration lethal to 50 percent of the exposed population (LC50), and the lower limit of toxic concentration (TCLO). For other administration routes, there may be available the lower limit of lethal dosage (LDLO), the dosage lethal to 50 percent of the treated population (LD50), or the lower limit of toxic dosage (TDLO). In defining a search, the user must select one of these six measures; but, if a general retrieval is required, the criterion 'ALL' may be used and will permit retrieval of any entry, irrespective of the type of toxicity measurement.



In cases where the toxicity measure is the lower limit of toxic concentration or dosage (TCLO or TDLO), the data base allows specification of any of the four 'special toxic effects': carcinogenesis, mutagenesis, neoplastigenesis or teratogenesis. In a search for TCLO or TDLO data, any or all of these special toxic effects may be specified, and only entries with data of this sort will be retrieved.



Finally, the quantitative value of the toxic effect may be specified at the discretion of the user. This quantitative value may be ignored, in which case all entries meeting the previously defined criteria will be retrieved, but alternatively, should be the user invoke this option, an

opportunity is provided to specify the upper and lower limits of the measured

toxicity.

The searching program queries the user about each of these five criteria in turn and then carries out the searches to provide all citations that match the defined criteria. The searches are carried out using inverted files with pointers, a data management technique that is used in many CIS components and is described in detail elsewhere (Heller).



The inverted file is a re-ordered version of the data base and in the case of the RTECS, is constructed in five hierarchical levels. A search through this hierarchical file for all compounds having an oral LD50 in pigeons of less than 20 mg/kg would proceed as follows: first, the animal pointer file is consulted and the entry "PGN," corresponding to pigeons, is found. Associated with it is a pointer to where in the dosage information file those dosages which are pertinent to measurements of pigeon toxicity begin. The search system then begins at that place in the dosage information file and searches down it for the entry "ORL."Here is found a pointer to the correct address in the measurement information file to find the data on oral pigeon toxicities. This process continues down to the lowest level list which contains the CAS registry numbers of the chemicals that fit the search criteria entered by the user. The major advantage of this approach is that it is very fast, requiring less than 3 cpu seconds on the DEC System 10 computer for which it was written. The inverted file can, thus, be made the basis of an interactive program such as this because the response time is so short that the user can in fact carry on a 'conversation' in an effort to arrive at the required information.



When a search is complete, the user is notified that the search resulted in the retrieval of a certain number of compounds, which are currently stored in a specific temporary file. If the user wishes, this file can be examined by the command 'TSHOW.' The program will ask how many should be listed. Any number, up to the number of entries in the temporary file, can be given by the user, and the corresponding number of entries will be typed out. The user then has the choice of leaving the 'TSHOW' program or continuing the listing



In the absence of any other instructions, the TSHOW program will

list each of the retrieved entries in its entirety. This may be desired, but as is often the case, the user is only interested in toxicity data specific to the search. If for example, the search was for toxicity in mice, human toxicity data for the same compounds may be irrelevant. To avoid listing such irrelevant data, the command 'MASK' may be invoked. This simply screens out from the data to be listed, all information not strictly concerned with the search criteria. This is a 'toggle' or on-off operator, and can subsequently be inactivated by the command 'UNMASK.'



Every chemical substance in the RTECS is identified by a name and a sequence number, which consists of two letters followed by seven digits. The name used is based upon the Chemical Abstracts Service (CAS) systematic nomenclature rules. Synonyms are provided for many of the entries. As a first step in the merger of the RTECS into the NIH-EPA Chemical Information System (CIS), the entire list of compunds in the RTECS was submitted to CAS for registration. This process, described in detail by Heller et al., results in the CAS Registry Number for each identifiable substance, and is required for all EPA files (EPA Regulation 2800.2), including those of the NIH-EPA CIS. Once the Registry number for each compound is obtained, it is appended to the RTECS record for that compound and is also used to retrieve from the CAS Registry (Dittmar et al.) all known synonyms for the compound, its molecular weight and molecular formula, and its connection table (a computer-readable version of its structural formula).



The Connection Table of the compound is the basis of the Structure and Nomenclature Search System (Feldmann et al.), a central component of the CIS. The Structure and Nomenclature Search System permits one to search through a variety of files of chemicals, including the list of chemicals in the RTECS, for substances with a given name, structure or substructure. Upon completion of a search, the Registry Number(s) of the substances(s) retrieved are provided to the searcher, who may use these five-nine-digit universal identifiers in a variety of ways. Using the DIALOG (Lockheed) or TOXLINE (NLM) systems, for example, they can be used to retrieve citations to papers dealing with the compounds in question. In the RTECS search system, the CAS Registry Number represents the easiest way to retrieve the complete RTECS entry for a compound of interest. The same entry can be retrieved using the RTECS accession number, but these numbers are not as readily available. The CAS Registry Number, on the other hand, is a widely used identifier, which can be learned from a variety of sources, such as the CIS Structure and Nomenclature Search System. When either the CAS Registry Number or the NIOSH accession number is provided to the search system, the entire entry corresponding to the given number is provided to the user.



The command 'TSHOW' permits the listing of a complete RTECS entry given the appropriate CAS Registry number or the number of a temporary file of retrievals from a search. If TSHOW is provided with a one- or two-digit number, the program recognizes that this cannot be CAS Registry Number and that it must be the number of a temporary file. Accordingly, it takes the appropriate temporary file of Registry Numbers and retrieves the corresponding RTECS entries. If a name search or a structure or substructure search has been carried out using the Structure and Nomenclature Search System, and it is necessary to retrieve the RTECS entry for each of the compounds of interest, this may be done, from within the Structure and Nomenclature Search System, with the command TSHOW. Upon receipt of the TSHOW command, the Structure and Nomenclature Search System controller takes the specified temporary file of CAS Registry numbers and 'carries' it to the RTECS Search System, conducts the retrievals, and returns the results to the user who is still in the Structure and Nomenclature Search System. A list of CAS Registry Numbers may be entered from the user's terminal with the command INCAS and a list of NIOSH RTECS numbers can be used for the same purpose with the command NIOSH.



This large-scale retrieval of toxicity data for many compounds of specific structural types is the basis of current efforts within NIH and EPA to define structure-toxicity relationships and so devise methods for handling the need that both agencies possess for estimating possible toxicities in compounds for which little toxicological study has been done.



RESULTS



A search through the RTECS for a specific structure is shown in Figure 1. First the Structure and Nomenclature Search System is used, but rather than search through all 32 of the available data bases, the user may limit the search to data base 32, the RTECS. This done, the required structure of p-dichlorobenzene is generated (Feldmann et al.) using the commands RING, ALTBD 1 2, ABRAN 1 AT 1 1 AT 4, and SATOM 7 8. The search for this structure is requested with the command IDENT, and the entry containing that compound is retrieved and stored in temporary file 1. When this file is examined with the SSHOW command, the CAS Registry Number and RTECS number of p-dichlorobenzene are given, together with the molecular and structural formulas and a number of available synonyms for the compound. Transfer to the RTECS Search System, by use of the command TSHOW (an implicit transfer from one component to another of the CIS by the use of a command unique to the latter) results in the immediate retrieval of all the available toxicity information on the compound. Each line of the output in Figure 1 gives the type of dosage, followed by the animal species, the type of toxicity measurement, the observed toxicity level, the specific toxic effects observed, if any, and finally, the literature citation. All journals are referred to in the form of the standard ASTM Coden (CODEN). Given the Coden, the full name of the journal can be obtained from standard sources or by using the command CODEN, which accepts a Coden and retrieves the correct name of the journal.



A different approach to the retrieval of toxicity information is shown in Figure 2. Here, the object of the search was any toxicological data on a compund known as ethyl ziram. Using the option NPROBE, the entry corresponding to the full name 'ETHYL ZIRAM' was retrieved and stored in temporary file 1. Examination of this file, using the SSHOW option resulted in the CAS Registry number of the compound and its NIOSH RTECS number, together with the appropriate molecular and structural formulas and various synonyms for the compound. Then use of the TSHOW command causes the program to turn to RTECS and retrieve the toxicity data for the compound, as before.



Searching in the reverse direction, i.e. for compunds exhibiting a specific toxicity pattern, is slightly more complicated, because more parameters are involved in the definition of the toxicity pattern. In conducting a search of compounds having a specific toxicity pattern, it is useful to define each of the five available parameters as closely as possible, so as to restrict the results of a search to manageable proportions.



An example of a search through the RTECS is shown in Figure 3. The option 'SEARCH' permits searching for any combination of the five parameters defined above. First, the user is asked to define the type of animal. Twenty-eight animal types and seven animal groupings, shown in Table 1, are reported in RTECS and any one of these may be specified at this point. The animal in question may be identified using the normal full word (mouse, monkey) or the RTECS abbreviation (mus, mky). In the example in Figure 3, the animal type was defined as pigeon (PGN), and the program goes on to enquire as to the dosage method. Once the oral route (ORL) is defined, the next question, the type of toxicity measurement, is requested, and the LD50 value is entered. Finally, the user is asked if the limits of toxicity are to be defined. The response to this query is positive, the limits are requested, and given as <20 mg/kg.



The search for all entries meeting these criteria, in the case shown in Figure 3, for all compounds showing an oral LD50 below 20 mg/kg in pigeons, will then be carried out and the number of entries found will be reported. In this case, 19 such entries are found and these are stored in a temporary file pending the user's decision as to how to handle these data. Since this number of entries is quite manageable, it is reasonable to print at least a few of them, and so the first four are listed using the TSHOW option. First however, the option 'MASK' is invoked. This ensures that, when the four entries are listed, only the parts of those entries that match the query will be given. Thus, there may be data associated with one of the 19 compounds and concerning rat oral LD50 values, but interest was only expressed in pigeon oral LD50 values below 20 mg/kg and only the information that meets these criteria will be printed. If the other toxicity measurements are required, then the option 'UNMASK' will reverse the effect of 'MASK' and all the data will be given, as will be the case if 'MASK' were never invoked. A survey of the RTECS for known carcinogens is given in Figure 4. The search is to be conducted irrespective of animal type but only TDLO measurements are requested. This causes the program to inquire as to which special toxic effects are of interest, and at this point, the user can respond with 'CAR' for 'CARCINOGENESIS.' Finally, the search is limited to compounds with a TDLO value below 100 mg/kg and it is then carried out. A total of 25 compounds satisfy all of these criteria and these are stored in file 4. The MASK option is

invoked and then the first eight entries in file 4, which is ordered by CAS Registry number, are listed. If one then wished to display the structure of each, the command SSHOW 4 would transfer the user to the Structure and Nomenclature Search System, where the structures of the compounds in file 4 can be displayed.



The entire CIS, including the RTECS Search System described here, is developed using IBM 370/168 and DEC System 10 computers at NIH. Once a component of the CIS is developed and tested, it is installed upon a time-shared, networked computer in the private sector, where it becomes available upon a fee-for-service basis to the international scientific community. The RTECS is now available for use in this manner. Because it has been installed upon a networked computer, access to the system can be achieved from most cities in North America and Europe without use of long distance telephone calls; a local call is usually sufficient. The cost of using the entire CIS, which includes the RTECS Search System, is $36 per connect hour. Most searches require less than 5 minutes of real, or connect time and the cost per search, therefore, is less than $3.00. The system is available 7 days per week on a 24-hour basis, and it is expected that the data base will be updated at least twice a year (1).



SUMMARY



The RTECS Search System is the first operational module of a larger chemical structure/toxicology search system. Current plans call for this system to contain, in addition to the RTECS, an Aquatic Toxicity Data Base, a single cell mutagenicity data base, the data bases of the Environmental Mutagen Information Center and the Environmental Teratogen Information Center as well as the Clinical Toxicology of Commercial Products (Gleason et al.). All these toxicological data bases will, as part of the NIH-EPA Chemical Information System, be structurally searchable and this should provide a very valuable tool for the study of the complex relationships between specific types of toxicity and chemical structure.



REFERENCES



CODEN For Periodical Titles. American Society for Testing and Materials, 1916 Race St., Philadelphia, Pa., 19103. (1970).



Dittmar, P. D., Stobaugh, R. E., and Watson, C. E. Jr.: J. Chem. Inf. Comp. Sci., 16:111, (1976).



Feldmann, R. J., Milne, G. W. A., Heller, S. R., Fein, A., Miller, J. A., and Koch, B.: J. Chem. Ink Comp. Sci., 17:157, (1977).



Gleason, M. N., Gosselin, R. E., Hodge, H. C., and Smith, R. P.: Acute Poisoning, Home and Farm. In Clinical Toxicology of Commercial Products, Fourth Edition. Baltimore, Williams and Wilkins, 1976.



Heller, S. R.: Anal. Chem., 44:1951, (1972).



Heller, S. R., Milne, G. W. A., and Feldmann, R. J.: J. Chem. Inf. Comp. Sci., 16:232, (1976).



Heller, S. R., Milne, G. W. A., and Feldmann, R. J.: Science, 195:253, (1977).



Lockheed Information Systems, DIALOG System, 3251. Hanover, Calif., 94304.



NIOSH Registry of Toxic Effects of Chemical Substances. Govt. Printing Office. June 1976. Page v.



NLM, National Library of Medicine, TOXLINE System. NIH, Bethesda, Md., 20014.





1)Those seeking access to the system should contact the CIS Operations Manager, H. J. Bernstein, Department of Chemistry, Brookhaven National Laboratory, Upton, N.Y., 11973. Telephone (516)

345-4379.





Received May 18, 1978

Accepted June2, 1978