The NIH-EPA Chemical Information System
G. W. A. MILNE
National Institutes of Health, Bethesda, MD 20014
S. R. HELLER
Environmental Protection Agency, Washington, DC 20460
The quantity of data associated with analytical chemistry has been expanding very rapidly
during the last twenty years or so, but until recently, the efficient application of computers to this
problem has been vitiated by the high costs of computer storage and computation. With the
continuing improvement in computer technology and the steady decrease in computation costs, it
has, in the last two years, become feasible to consider the development of a completely
interactive chemical information system.
Earlier searching systems avoided the cost of bulk storage by maintaining the data files on
magnetic tape rather than disk. Tape is a very cheap form of storage but its use implies batch
searching which is necessarily slow because tape in not susceptible to random access. Magnetic
disks, on the other hand, are random access devices and the data stored on them can be searched
very rapidly. Until recently, the cost of storage on disks has precluded their use for large data
bases, but now it is becoming practical to consider this approach. Since this permits interactive
computing, we have developed a chemical information system that uses disk exclusively for the
storage of data.
Interactive computing is a significantly different process from batch, or off-line, computing and a different philosophy can be used in the design of program for such work. A major problem that a chemist has in searching a chemical data base, is that the best questions to ask are often not known. An interactive system can provide the answer to a question immediately and this will enable the user to see the deficiencies in the question and to frame a new query. In this way, there can be build a feedback loop in which the chemist acts as a transducer, "tuning" the query until th system reports precisely what is required
The NIH-EPA Chemical Information System (CIS), which is described in this chapter, has
been designed around this general approach (1).
The CIS consists of a collection of chemical data bases together with a battery of programs
for interactive searching through these disk-stored data bases. In addition, there are a series of
programs for the analysis of data, either to reduce them to a form suitable for searching purposes
or as an end in itself.
The data bases that are in the CIS include files of mass spectra, carbon-13 nuclear magnetic
resonance spectra, x-ray diffraction data for crystals and powders, and several bibliographic data
bases. The analytical programs include a family of statistical analysis and mathematical
modelling algorithms and programs for the calculation of isotopic enrichment from mass spectral
data, analysis of nmr spectra and energy minimization of conformational structures.
a. Addition of components to the CIS. A general protocol for updating of CIS components
or the addition to the CIS of new components has been established and a schematic diagram of
this is shown in Figure 1.
In the first phase, a data base is acquired, if necessary from one of a variety of sources. Some
of the CIS data bases have been developed specifically for the CIS, and example of this being the
mass spectral data base (2). Other data bases, such as the Cambridge Crystal File (3), are leased
for use in the CIS and still others, such as the X-ray powder diffraction file (4), are operated
within the CIS by their owners. Next, the necessary program development is undertaken. If the
component is one involving searching of a data base, some reformatting of the data base, sorting
and inversion of files and so on, is usually required, and this is carried out on the NIH IBM 360-168, which is well-suited to processing of large files of data. Once searchable files have been
prepared, they are transferred to the NIH PDP-10 computer which is primarily a time-sharing,
computer, and the programs for searching through the data bases are written. The analytical, data
base-independent programs of the CIS are usually written entirely on the PDP-10. Out of this
work there finally emerges a pilot version of the component.
The pilot version is then allowed to run on the NIH PDP-10 and access to it is provided to a
small number of people who can log into the NIH computer by telephone, using long distance
calls if necessary. These users are provided with free computation and in return, they test the
component throughly for errors and deficiencies. Such problems are reported to the development
team, which attempts to deal with them. Depending upon the size and complexity of the
component, this testing phase can last as long as eighteen months.
When testing is complete, the entire component is exported to a networked PDP-10 in the
private sector and the version on the NIH computer is dispensed with. The component in the
private sector is available to the general scientific community and can be used on a fee-for-service basis. In this phase, the government retains no financial interest in the component; it is
"managed" by a sponsor outside the U.S. Government. The Department of Industry of the
British Government, for example, maintains the Mass Spectral Search System on the network.
Advise and consultation between such sponsors and NIH/EPA personnel continues, but the U.S.
Government does not subsidize the routine operation of CIS components in the private sector. In
fact, various government agencies of the government are actually users of the CIS and they pay
according to their use of the system, like any other user. Charges for use of these components
must be designed to cover costs, and if the component attracts insufficient use of these prices,
then it is probably not viable and its sponsor need not continue to support it.
b. Computers facilities used by the CIS. Programs of the CIS have usually been designed for
use with a DEC PDP-10 computer system. The reason for this is that the PDP-10 is one of the
better time-sharing systems available and has been adopted by a number of commercial computer
network companies as the main vehicle for their networks. Transfer of a program from the NIH-PDP-10 to a network PDP-10 is usually straightforward, and use of a networked computer is
favored because the alternative philosophy of exporting programs and data bases to locally
operated PDP-10 computers is less workable. This latter approach contains a number of
deficiencies that are overcome by a network. Most important in this connection is the fact that
use of a network machine means that data bases need only be stored once, at the center of the
network. A great deal of money is thus saved because duplicate storage is not necessary.
Further, a single copy of a data base is easy to maintain, whereas updating of a data base that
resides on many computers is virtually impossible. Finally, communications between systems,
personnel, and users is very simple in a network environment, as is monitoring of system
For these and other reasons, the policy of disseminating the CIS via networked PDP-10
computers was adopted at the outset and has proved to be quite successful. A typical U.S.
network of this sort has something under 100 nodes - i.e., local telephone call access is available
in about 100 locations. These are mainly in the U.S., but a substantial number will be found in
Europe. Further, some computer networks are now themselves interfaced to the Telex network,
thus making their computer system available worldwide. Irrespective of one's location, the cost
of access is somewhere between $7 and $15 per hour, depending upon the transmission speed
used and also on the time of day. Networks usually offer 110, 300 and 1200 baud service and the
response time of the system is usually negligibly small.
The only equipment that is required to establish access to a computer network, is a telephone-coupled computer terminal. Typewriter terminals are becoming very common and are also
becoming relatively cheap. Such a terminal can be purchased from a variety of manufacturers for
between $1,000 and $3,000 and in general, will operate at 300 baud (30 characters/second). A
cathode ray terminal, capable of running at 1200 baud can be purchased for as little as $2,000.
Any equipment of this sort can usually be leased or purchased.
Components of the CIS
a. Mass Spectral Search System. The Mass Spectral Search System (MSSS) is the oldest
component of the CIS, having been developed in 1971, and has been seen as a prototype for
more recent components. Developed as a joint effort between NIH, EPA, and the Mass
Spectrometry Data Centre (MSDC) in England, the current MSSS data base contains about
30,000 mass spectra representing the same number of compounds. This has been derived from
an archival file containing some 60,000 spectra of the same 30,000 compounds (5). Computer
techniques have been employed to assign every spectrum a quality index (6) and where duplicate
spectra appear in the archive file, only the best spectrum is used in the working file. All
compounds in the archive have been assigned Chemical Abstracts Service (CAS) registry
number, a unique identifier that is used to locate duplicate entries for the compounds, find the
compound in other CIS files and provide structure and synonym lookup capabilities throughout
Searchers through the MSSS data base can be carried out in a number of ways. With the
mass spectrum of an unknown in hand, the search can be conducted interactively, as is shown in
Figure 2. In this search the user finds that 24 data base spectra have a base peak (minimum
intensity 100%, maxima intensity 100%) at an m/e value of 344. When this subset is examined
for spectra containing a peak at m/e 326 with intensity of less than 10%, only 2 spectra are found.
If necessary, the search can be continued in this way until a manageable number of spectra are
retrieved as fulfilling all criteria that the user cited. These answers can then be listed as is show
in Figure 2. Alternatively, the file can be examined for all occurrences of a specific molecular
weight or a partial or complete molecular formula. Combinations of these properties can also be
used in searches. Thus, all compounds containing, for example, five chlorines and whose mass
spectra have a base peak at a particular m/e value can be identified.
In contrast to these interactive searches, which are of little appeal to those with large numbers
of searches to carry out, there is available a batch-type search which accepts the complete
spectrum of the unknown and sequentially examines all spectra in the file to find the best fits. A
user's data system can be connected to the network for this purpose and the unknown spectra can
be down-loaded into the network computer for use in this search, which can be carried out at
once, or, preferred, overnight at 30% of the cost.
Once an identification has been made, the name and registry number of the data base
compound are reported to the user. If necessary, the data base spectrum can be listed or, if a
CRT terminal is being used, plotted, to facilitate direct comparison of the unknown and standard
The MSSS has been generally available through computer network for several years and is
now currently resident upon the ADP-Cyphernetics network where some 3,000 searches and
2,000 other transactions, such as retrievals, are carried out each month by the approximately 200
users. All searches in the MSSS are transaction priced at between $1 and $7 and in addition to
these charges and the connect time charge, users must pay the annual subscription fee of $300.
This fee is used to defray the annual disk storage charges which are paid in advance by the
sponsor of the MSSS, the Department of Industry of the British Government.
b. Carbon Nuclear Magnetic Resonance (CNMR) Spectral Search System. The data base
that is used in the CNMR search system consists currently of 4, 400 CNMR spectra. As in the
case of the MSSS, every compound has a CAS registry number, and all exact duplicates have
been removed from the file. A Specific compound may still appear in this file more than once,
however, because its CNMR spectrum may have been recorded in different solvent. The CNMR
file is still small but is growing at a fairly steady rate and should benefit considerably from recent
international agreements to the effect that all major compilations of CNMR data will, in the
future, be pooled.
Searching through this data base, as in the case of the MSSS, can be interactive or not. In the
interactive search, a user enters a shift, with an acceptable deviation, and the single frequency off
resonance decoupled multiplicity, if that is know. The program reports the number of files
spectra fitting one or both of these criteria. The names of the compounds whose spectra have
been retrieved can be listed, or alternatively, the list can be reduced by the entry of a second
chemical shift. A search for spectra of compounds having a specific complete or partial
molecular formula can also be carried out, but there is no capability for searching on molecular
weight, a parameter of little relevance to CNMR spectroscopy.
If an interactive search in not appropriate to the problem at hand, a batch type of search
through the data base using the techniques described by Clerc et al. (7) is available. To carry out
such a research, the user enters all the chemical shifts from the unknown ans starts the search.
The entire unknown spectrum is compared to every entry in the file and the best fits are noted
and reported to the user. This program searches for the absence of peaks in a given region as
well as for the presence of peaks and thus has the capability of finding those compounds which
are structurally similar to the material that gave the unknown spectrum.
When a search is completed, the user is provided with the accession numbers of spectra that
fit the input data. The names and CAS registry numbers of the compounds in question will also
be given. If more information is required, the complete entry for a given accession number can
be retrieved. This includes a numbered structural formula, the name, molecular formula and
registry number of the compound, experimental data pertaining to the spectrum and the entire
spectrum, together with single frequency off-resonance decoupled multiplicities and (if available)
relative line intensities and assignments.
This CNMR search system recently has been made available on the ADP-Cyphernetics
network. Searches are all transaction priced at $1-3.
c. X-ray Crystallographic Search System. This is a series of search programs working
against the Cambridge Crystal File (8), a data base of some 15,000 entries dealing with published
crystallographic data mainly for organic compounds. The entry for each compound contains the
compound name, its molecular weight and registry number, the space group in which it
crystallizes and the parameters of the unit cell of the crystal. The file may be searched on the
basis of any of these parameters as shown in Figure 3, which is an example of a search for any
compounds that crystallize in space group P 1 and have a molecular weight between 250 and
300. As can be seen from Figure 3, there are 98 entries with the correct space group (scratch file
1) and 867 with a molecular weight within the specified range (scratch file 3). The intersection
of these files reveals that only 3 compounds (scratch file 4) meet both specifications, and the first
of these compounds, crystal sequence number 4413, is listed.
All the compounds in this file have been registered by the CAS and their connection tables
have been merged into the file. This data base is, therefore, searchable on a structural of
substructural basis, as are all the other files of the CIS.
Once and entry of interest in this file has been located by one of the search programs, its file
accession number, the "crystal sequency number" can be used to retrieve the appropriate
literature reference or the structure, or both.
This file is available for general use via the ADP-Cyphermetics network. Currently, the
charging of options in this system is not transaction-priced. Enough statistics are now available
to indicate that all searches, other than structural searches, cost les than $2.00 and that the
structural searches cost possibly as much as $10.00.
d. X-ray Crystal Data Retrieval System. The National Bureau of Standards (NBS) has
collected a file of data pertaining to some 24,000 crystals, including those in the Cambridge file
described in ( c ) above (9). The data in the NBS file include the cell parameters, the number of
molecules, Z, in the unit cell, the measured and calculated densities of the crystal and two
determinative ratios, such as A/B and A/C. Every compound in the file is identified by its name,
molecular formula, and registry number, and the file can be structurally searched by the CIS
substructure search system which is described below.
Searches through this data base for crystals with specific space groups, or densities are
possible and crystals with unit cells of given dimensions can also be found. It is hoped that this
may prove to be a very rapid method of identifying compounds from the readily measured crystal
e. X-ray Powder Diffraction Retrieval Program. Compounds that fail to crystallize may still
be examined by X-ray diffraction, because powders give characteristic diffraction patterns. A
collection of powder diffraction patterns proves to be a very effective means by which to identify
materials and indeed, one of the very earliest search systems in chemical analysis was based
upon such data by Hanawalt (10) nearly forty years ago. The data base of some 27,000 powder
diffraction patterns (11) that is used in the CIS is in fact a direct descendant of that with which
Hanawalt carried out his pioneering work. A problem that arises in connection with this
particular component stems from the fact that powders, as opposed to crystals, are frequently
impure. The patterns that are obtained experimentally, therefore, are often combinations of one
or more file entries. A reverse searching program, that examines the experimental data to see if
each entry from the file is contained in it (12), has been written and seems to cope with this
particular difficulty. It is currently running in test on the NIH PDP-10 and will be made
available to the scientific community during 1977.
f. Substructure Search System (SSS). All the compounds in the files of the CIS have been
assigned a registry number by the CAS. The registry number is a unique identifier for that
compound, and may be used to retrieve from the CAS Master Registry, all the synonyms that the
CAS has identified for the compound, these being names that have been used for the compound,
in addition to the name used in the CAS 9th Collective Index. Further, the registry number can
be used to locate in the CAS files, the connection table for the compound's structure. This is
Two-dimensional record of all atoms in the molecule together with the atoms to which each is
bonded and the nature of the bonds (13). This connection table is the basis of the substructure
search component of the CIS (14).
The purpose of the SSS is to permit a search for a user-defined structure or substructure
through data bases of the CIS. If a substructure is found to be in a CIS data base, then armed
with its registry number, the user can access that file and locate the compound and hence inspect
whatever data is available for it.
As the first step in this process, the user must, of course, be able to define the structure of
interest to the computer. This is done with a family of structure generation programs which can,
for example, create a ring of a given size, a chain of a given length, a fused ring system and so
on. Branches, bonds and atoms can be added and the nature of bonds can be specified. The
element represented by any nodes can be defined; in the absence of such definition, the atom is
presumed to be carbon. As the query structure is developed using these commands, the computer
stores the growing connection table. If the user wishes to view the current structure at any point,
the display command (D) can be invoked. This command, using the current connection table,
generates a structure diagram that can be printed at a conventional terminal.
When the appropriate query structure has been generated, a number of search options can be
invoked to find occurrences of this query structure in the data base. The two most useful search
options are the fragment probe and the ring probe. The fragment probe will search through the
assembled connection tables of the data base for all occurrences of a particular fragment, i.e., a
specific atom, together with all its neighbors and bonds. The user may specify particular
fragments in the query structure which are thought to be fairly unique and characteristic of that
structure. Alternatively, a search for every fragment in the query structure may be requested.
The general form of a fragment probe is as shown in Figure 4. The query structure contains only
one relatively unique node, C3, and this is the one which is sought in the data base. It is found to
occur 980 times and a temporary file of just those particular entries is stored as file #2. This can
be accessed by the user either for the purpose of listing its contents, as is shown in the figure, or
to intersect it with other scratch files.
The ring probe search is a search through the data base for all structures containing the same
ring or rings as the query structure. A ring that is considered to be an answer to such a query
must be the same size as that in the query structure. It must also contain the same number of
non-carbon atoms (hetero-atoms) but the nature of the heteroatoms and the position of any
substituents can be required by the user to be the same or different to that in the query structure.
Thus with a query structure of pyrrole, the only "exact" answer is pyrrole but the user may
permit the retrieval of "imbedded" answers which would include furan and thiophene. Similarly,
o-xylene itself is the only "exact" match for o-xylene, but m- and p-xylene would be considered
as "imbedded" matches. An example of a ring probe search is given in Figure 5. Here the query
structure is a 3,4-dichlor ofuran, but imbedded matches for heteroatoms and substituents have
been allowed and so the list of 304 answers will include any disubstituted pyrrole as well as any
disubstituted furan and so on. Higher substitution will also be permitted.
In addition to these structural searches, there are a number of "special properties" searches
that often prove to be very useful as a means of reducing a large list of answers resulting from
structure searches. The special properties searches include searches for a specific molecular
weight or range of molecular weights and a search for compounds containing a given number of
rings. Searches may also be conducted for the molecular formula corresponding to the query
structure, or for a different user-defined molecular formula. This may be specified completely or
partially and the number of atoms of any element may be entered exactly or as a permissible
If one's purpose is to determine only the presence or absence in a data base of a specific
structure, this can be accomplished with the search option "IDENT." This program hash-encodes
the query structure connection table and searches through a file of hash-encoded connection table
for an exact match. The search, which is very fast by substructure search standards, has been
designed specifically for those users who, to comply with the Toxic Substances Control Act (15),
have to determine the presence or absence of specific compounds in Environmental Protection
Finally, if one has completed ring probe and fragment probe searches for a specific query
structure and is still confronted with a sizeable file of compounds that satisfy the criteria that
were nominated, a substructure search through this file may be carried out. This involves an
atom-by-atom, bond-by-bond comparison of every structure.
3. MILNE AND HELLER
The substructure search system is currently operating on 19 files which are given in Table 1.
The whole system is available for general use on the Tymshare computer network. A
subscription fee of $150 per year must be paid for use of the system and the only other charges
are the connect-time charge and the searching costs which range upwards from $3.00-5.00 for an
g. Mass Spectrometry Literature Search System. The accumulated files of the Mass
Spectrometry Bulletin, a serial publication of the Mass Spectrometry Data Centre, Aldermaston,
England, have been made the basis of an on-line search system.
The Bulletin, which since 1967, has collected about 60,000 citations to papers on mass
spectrometry, may be searched interactively for all papers by given authors, all papers dealing
with one or more specific subjects or with one or more particular elements. In addition, citations
dealing with general index terms may also be retrieved. Simple Boolean logic is available, and
thus searches may be conducted for papers by Smith and Jones, or Smith but not Jones, and so
on. Citations retrieved may be limited to specific publication years, between 1967 and the
present. The interactive nature of the search provides great control to the user. One can learn
within a few minutes that while there are in the Bulletin, 463 papers dealing with mass
measurement for example, and 678 on chemical ionization, only 8 report on mass measurement
in chemical ionization mass spectra. Similarly, one can rapidly discover that although there are
532 papers dealing with carbon dioxide, only 1 of these was presented at the 1975 NATO
meeting in Biarritz.
No numerical codes are used by the system. A search for a specific subject can be carried out
by entering the subject word itself. If the word "mass" is entered, searches for 7 terms (all those
containing the fragment "mas", i.e., mass spectra, mass discrimination, mass measurement and so
on) are conducted and the user is asked to select the one of interest. In this way, knowledge of
the correct subject words or of their correct spelling in not necessary.
The whole search system has been written for use on a PDP-10 computer and is a component
of the MSDC-NIH-EPA Mass Spectral Search System. As such, it is accessible via the ADP-Cyphernetics computer network.
h. X-ray Crystal Literature Retrieval System. The data base used in the X-ray
crystallographic search system described in ( c) above possesses complete literature references to
all entries in the file (8). This information has been made the basis of a system for searching the
literature pertaining to the X-ray diffraction study of organic molecules.
As in the Mass Spectrometry Bulletin Search System, it is possible to search for papers by a
specific author (s), and papers that appeared in given years in given journals may also be
retrieved. Additionally, papers may be located on the basis of specific words appearing in their
titles. These words may be truncated by the user and so the fragment "ERO" will retrieve papers
with the word "STEROID" on their titles or papers whose titles use the word
"MEROQUININE." The system generates scratch files from searches, as in the substructure
search system, and files can be intersected upon request with "AND" or "NOT" operators. Thus
one could, for example, retrieve all papers published in Acta Crystallographic since 1970 by
Atkins, excluding specifically those on corticosteroids.
Once a paper of interest has been identified, all the crystallographic information in that paper
can be examined because the crystal serial number of the paper can be used in the
Cambridge (Xray) Crystal File.
CPSC Chemicals in Consumer Products.
EPA AEROS SOTDAT File.
EPA Las Vegas Chemical Spill File.
EPA Storage and Retrieval of Air Data.
EPA Pesticide Standards.
EPA STORET Water Data Base.
EPA-FDA Pesticide Repository Standards.
EPA Inactive Ingredient in Pesticides.
EPA Oil and Hazardous Materials File.
EPA Pollutants in Drinking Water.
EPA Pesticides File.
EROICA Thermodynamics Data File.
NBS Gas Phase Proton Affinities.
NBS Heats of Formation of Gaseous Ions.
NBS Single Crystal File.
NCI-SRI Industrial Chemicals File.
NCI-PHS-149 File of Carcinogens.
NIMH File of Psychotropic Drugs.
NIH-EPA Carbon-13 Nuclear Magnetic Resonance Search System.
NIH-EPA Mass Spectral Search System.
WHO International Non-proprietary Name File of Drugs.
crystallographic search system to retrieve that information. Alternatively, the CAS registry
number of any particular compound can be used to retrieve any data of interest on that compound
from other files of the CIS.
The X-ray literature search system is operating on the ADP-Cyphernetics network as a
component of the X-ray crystallographic search system. Searches are not transaction priced but
cost under $2.00 on average.
I. Proton Affinity Retrieval Program. With the current high level of interest in chemical
ionization mass spectrometry, there is a need for a reliable file of gas phase proton affinities. No
data base of this sort has previously been assembled and for these reasons, the task of gathering
and evaluating all published gas phase proton affinities has been undertaken by Rosenstock and
coworker at NBS. This file (16), which has about 400 critically evaluated gas phase proton
affinities drawn from the open literature, can be searched on the basis of compound type or of the
proton affinity value. It will be appended to the MSSS and the bibliographic component will be
merged with the Mass Spectrometry Bulletin Search System.
j. NMR Spectrum Analysis Program. Many proton nmr spectra can be satisfactorily
analyzed by hand, and such first order analysis is, in these cases, a quite satisfactory way of
assigning chemical shifts and coupling constants to the various nuclei involved. In certain cases,
however, so-called second order effects become important and as result, more of fewer spectral
lines than are indicated by first order consideration will result. A way to analyze such spectra is
to estimate the various coupling constants and chemical shifts and then, using any of a variety of
standard computer programs (17), calculate the theoretical spectrum corresponding to these
values. The calculated spectrum can be compared to the observed spectrum and a new estimate
of the data can be made. In this way, by a series of successive approximations, the correct
coupling constants and chemical shifts can be determined.
The CIS component GINA (Graphical Interactive NMR Analysis) which is based upon the
programs developed by Johannesen et al. (18), permits these operations in real time in an
interactive fashion. The program is designed for use with a vector cathode ray tube terminal
upon which each new theoretical spectrum can be display for comparison by the user with the
observed spectrum. The program has been available at NIH for over four years (19) and is
currently being exported to a computer network in the private sector. The cost of using the
program in not yet well established because it is subject to wide variations.
k. Mathematical Modelling System(MLAB). MLAB is a program set, developed by Knott at
NIH (20) which can assimilate a file of experimental data, such as a titration curve and perform
on it any of a wide variety of mathematical operations. Included amongst these are differential
and integral calculus, statistical analysis (mean and standard deviation, curve and distribution
fitting and linear and non-linear regression analysis). Output data can be presented in any form,
but PDP-10-resident program is specially powerful in the area of graphical output. Data can be
displayed in the form of two- or three-dimensional plots which can be viewed and modified on a
CRT terminal prior to pen-and-ink plotting.
This program set is now available for general use on the ADP-Cyphernetics network. The
cost of using MLAB is based upon the computer resource units and depends, therefore, upon the
type of work that is being done.
1. Isotopic Label Incorporation Determination (LABDET). Radioisotopes are particularly
well-suited to labelling studies because they can be very easily detected at very low levels. In
recent years, however, there has been increasing concern about the shortcomings of radioisotopes
in medical research. Current standards, in fact, take the position that the use of radioisotopes
such as carbon-14 in children and women of child-bearing age is precluded. Consequently, it is
not possible to study the metabolism of drugs in such patients using radioisotopes, and this leads
to some difficulty because it is only in such patient groups that the metabolism of drugs, such as
oral contraceptives, is of relevance.
Much effort has gone into studies of the possibilities of carbon-13 as a surrogate for carbon-14, and this type of work applies also to problems involving oxygen and nitrogen which have
stable isotopes, but no convenient radioisotopes. Mass spectrometry is the bes general method of
detection and quantitation of stable isotopes in molecules, but there a serious problem involved
in its application is that stable isotopes such as carbon-13, nitrogen-15 and oxigen-18 occurr
naturally as minor component of natural elements. This is most pronounced in the case of
carbon. Naturally-occurring carbon is about 99% C-12 but a small variable amount of all natural
carbon is C-13. This creates a "background" against which determinations of isotope levels in
labelled compounds must be measured. The purpose of LABDET (21) is to compare the mass
spectrum of an unlabelled compound with that of the same compound isotopically enriched.
This is usually done using the molecular ion region of the spectra. The program calculates an
estimate of the level of incorporation of isotope and then calculates a theoretical spectrum is then
adjusted and a further comparison is made, and in this way, the program proceeds through a
predetermined number of iterations, finally calculating the correlation coefficient between the
observed spectrum and the best theoretical spectrum.
This calculation is not difficult so much as tedious and if one must carry it out many times
per day, use of the computer is indicated. LABDET is an option within MSSS on the ADP-Cyphernetics network and its use cost $2.00.
m. Conformational Analysis of Molecules in Solution. A problem of long standing in
chemistry has been to estimate the relationship between the conformation of a molecule in the
crystal, as measured by X-ray methods, with that in solution where barriers to rotation are greatly
reduced. A sophisticated program set for Conformation Analysis of Molecules in Solution by
Empirical and Quantum-mechanical methods (CAMSEQ) has been develop for this purpose by
Hopfinger and coworkers (22) at Case Western Reserve University.
This program can run in batch or interactively. As input data, it requires the structure of the compound and this can be provided as a set of coordinate data from X-ray measurements, it can be entered interactively in the form of a connection table or the program can simple be provided with a CAS registry number, and if the corresponding connection table is in the files of the CIS, it will use that.
The first task is to generate the coordinate data corresponding to a particular compound.
Then the free energy of this conformation in solution is calculated, next the program begins to
change torsion angles specified by the user in the conformation and with each new conformation,
a statistical thermodynamic probability is calculated, based upon potential (steric, electrostatic,
and torsional) functions and terms for the free energy associated with hydrogen-bonding,
molecule-solvent, and moleculedipole interactions. This program, in its interactive version, can
be run in under 40K words of core, and work is in progress to export it to a commercial
One of the first goals of the CIS was to produce a series of searchable chemical data bases for
use by working analytical chemists with no especial computer expertise. A second aim was to
link these data bases together so that the user need not be restricted to a consideration of, for
example, only mass spectral data.
The various problems inherent in these plans included acquisition of data bases, design of
programs, dissemination of the resulting system and linking, via CAS registration numbers, of
the various CIS components. These problems, as has been described above, have been solved
conceptually and, to a large extend, practically, and the CIS, as it now stands, is the result. It is
now possible, therefore, to review the system in an effort to define future goals, and a number of
these seem fairly clear.
Searches through more than one data base in combination would be desirable. For example,
one often possesses mass spectral and nmr data for an unknown, and it would very useful to be
able to identify any compounds that match these data in a single search. In another development,
it is expected that the CONGEN program developed for the DENDRAL project (23) will be
merged into CIS during the coming year. This program, which generates structures
corresponding to a specific empirical formula, could be extremely useful in a strategy for
structure solving using the CIS. It is not all difficult to envisage situations in which a reduced set
of structures could be produced by CONGEN for consideration. Each structure in turn could be
used as an input in the substructure search system, and the various compounds whose registry
numbers are so retrieved could be considered to be possible answers to the problem.
Confirmation for any of them could then be sought in the spectral data bases, the registry number
being all that is necessary to locate and retrieve data. One can even speculate further to the day
when synthetic pathway to any likely candidates could be designed by the computer system
which could easily add the very practical touch of checking that any starting materials for such
syntheses are commercially available at an appropriately low cost!
In a different approach, the power of pattern recognition techniques could be assessed within
some of the very large files contained in the CIS. This is very useful exercise because there is
little reported work of this sort on large files and thus we have begun to explore the value of such
methods in handling the problem of identification of true unknowns such as water pollutants.
Programs designed to test mass spectra for the presence of the compound of oxygen or nitrogen
are currently being tested (24) and their utility as prefilters on mass spectral data prior to data
base searching will be tested as soon as feasible.
In summary, it is felt that progress to date with the CIS has demonstrated economic
feasibility in that a number of relatively stable CIS components have now been in the private
sector for some time. The test before us is whether we can capitalize on this to explore the new
and exciting possibilities that lie ahead in the area of structure determination by computer.
1. Heller, S. R., Milne, G. W. A., and Feldmann, R. J. Science, (1997) 195, 253.
2. Heller, S. R., Fales, H. M., and Milne, G.W. A. Org. Mass Spectrom., (1973) 7, 107; Heller, S. R., Koniver, D. A., Fales, H. M., and Milne, G. W. A. Anal. Chem. (1974) 46, 947; Heller, S. R., Feldmann, R. J., Fales, H. M., and M ilne, G. W. A. J. Chem. Soc., (1973) 13, 130; Heller, R. S., Milne, G. W. A., and Heller, S. R. J. Chem. Inf. Comp. Sci. (1976) 16, 176.
3. This data base (ref. 8) is leased by NIH on behalf of the entire U. S. scientific community.
4. McCarthy, G. And Johnson, G. G., paper C3 presented as a part of the Proccedings of the American Crystallographic Association meeting, State College, PA., 1974.
5. Heller, S. R., Milne, G. W. A., and Feldmann, R. J. J. Chem. Inform. Comp. Sci. (1976) 16, 132.
6. This is carried out using an unpublished program developed by McLafferty and co- workers at Cornell University.
7. Schwarzenbach, R., Milne, J., Koenitzer, H., and Clerc, J. T. Org. Magn. Res. (1976) 8, 11.
8. Kennard, O., Watson, D. G., and Town, W. G. J. Chem. Doc. (1972) 12, 14.
9. These data are available as NBS tape #9 through the National Technical Information Service, Springfield, VA 22151.
10. Hanawalt, J. D., Rinn, H. W., and Frevel, L. K. Ind. Eng. Chem. (Anal.) (1938) 10, 457.
11. This file is a proprietary product of the Joint Committee on Powder Diffraction Standards, 1601 Park Lane, Swarthmore, PA 19801.
12. Abramson, F. P. Anal. Chem. (1975) 47, 45.
13. Chemical Abstracts Service Standard Distribution Format File, 1976. Chemical Abstracts Service, Columbus, OH 43210.
14. Feldmann, R. J., Milne, G. W. A., Heller, S. R., Fein, A., Miller, J. A., and Koch, B. A. J. Chem. Inf. & Comp. Sci., (1977) in press.
15. Toxic Substances Control Act, Public Law 94-469, enacted October, 1976.
16. Hartmann, K., Lias, S., Ausloos, P. J., and Rosenstock, H. M. Publication NBSIR 76-1061, July, 1976.
17. Castellano, S. and Bothner-By, A. A. J. Chem. Phys. (1964), 41, 3863; Swalen, J. D. and Reilly, C. A. J. Chem. Phys. (1965) 42, 440.
18. Johannesen, R. B., Ferreti, J. A., and Harris, R. K. J. Magn. Res. (1970) 3, 84.
19. Heller, S. R. and Jacobson, A. E. Anal. Chem. (1972) 44, 2219.
20. Knott, G. D., and Shrager, R. I. Assn. Comp. Machin. SIGGRAPH Notes 6 (1972) 138.
21. Hammer, C. F., Department of Chemistry, Georgetown University, Washington, DC, unpublished work.
22. Weintraub, H. J. R., and Hopfinger, A. J. Int'l. J. Quant. Chem. (1975) 9, 203.
23. Carhart, R. E., Smith, D. H., Brown, H., and Djerassi, C. J. Amer. Chem. Soc. (1975) 97, 5755.
24. Meisel, W., Jolley, M., and Heller, S. R., in preparation.