A Carbon-13 Nuclear Magnetic Resonance
Spectral Data Base and Search System
D. L. Dalryrnple
Nicolet Technology Corporation, Mountain View, California 94041, USA
C. L. Wilkins
Department of Chemistry, university of Nebraska, Lincoln, Nebraska 68588, USA
G. W. A. Milne*
National Heart, Lung and Blood Institute, National Institutes of Health, Bethesda, Maryland 2001 '
USA
S. R. Heller*
Environmental Protection Agency, PM-218, Washington D.C., 20460, USA
* Authors to whom correspondence should be addressed
A data base containing approximately 4000 13C nuclear magnetic resonance spectra
has been assembled. The spectra have been evaluated and all the corresponding
compounds have been registered by the Chemical Abstracts Service (CAS). The data
base is available to the international scientific community on magnetic tape or
microfiche and is also the basis of a march item operating upon an international
computer network.
BACKGROUND
As the considerable utility of 13C NMR (CNMR) spectroscopy in biochemical, biological and
environmental research has emerged, it has become clear that a large, readily available file of
high quality reference CNMR spectra would be of great value as an adjunct to such work. No
data base of this sort was available and it was a fairly straightforward procedure to use the
experience gathered from our work in connection with the data base of the EPA-NIH Mass
Spectral Search System (l) to define the criteria necessary for the generation of a CNMR data
base.
Nomenclature must be handled according to an accepted standard, and for this reason, it was
decided to secure Chemical Abstracts Service (CAS) registry numbers and names for all
compounds. Spectra that were to be used in the data base must be evaluated one by one by
professional spectroscopists. The evaluation consists of the following steps. The name,
molecular weight, molecular formula and structure of the compound are checked to ensure that
they agree with one another. The spectrum is then reviewed to ascertain that the number of
lines is either consistent with the structure, or that any inconsistencies can be rationalized. All
assignments provided are inspected and those that seemed clearly in error are then confirmed
at the original source, or the assignment is deleted. Off-resonance decoupling data are checked
in the same way. Finally, a number of decisions must be made, as described below, as to which
spectra to retain and which to discard
The cost of such a project, while considerable, is justified in terms of the promise it gives of
more rapid and accurate identification of unknown organic compounds by members of the
research community. In addition, it is expected that, like the Mass Spectral Search System
(MSSS), this CNMR data base and search system will, as development decreases and use
increases, become largely self-supporting. Thus, in the spring of 1975, EPA and NIH initiated
a joint CNMR spectral data project. A contract calling for the coordination and evaluation of
the data base was awarded to the University of Nebraska and programs for interactive
searching through the data base were written by one of us (DLD).
At about this time, a joint consortium was organized to do much the same work in Europe and
as a result of joining of forces, the project quickly grew into an international collaborative
effort. Currently, data for the CNMR file are being collected by scientists in the United States,
the Netherlands, Germany, Switzerland, France, Hungary and Japan. Managerial responsibility
for the entire data base and search system has been assumed by the Netherlands Information
Combine (NIC), a part of the Royal Dutch Chemical Society. Copies of the data base are
available from the NIC and the complete search system is accessible by local telephone call
via an international computer network. At present, there are approximately 4000 spectra in the
data base, representing some 3900 compounds. Several compounds are represented more than
once in the data base since their spectra may be run under different experimental conditions
in which parameters such as temperature or solvent may be changed.
The major source of all the spectra is the open literature. Several collections of spectra have
been incorporated into the data base. For example, about 1400 spectra were obtained from
a file built by Dalrymple at the University of Delaware, 400 were derived from the files of
Clerc at ETH, Zurich and 900 were obtained by data measured by Roberts's group at the
California Institute of Technology and collected by Dorman of Eli Lilly Inc. Perhaps 10%
of all file entries are of unpublished spectra. These are submitted directly to the University
of Nebraska by the person who measures them.
A backlog of about 9000 spectra is in hand and should be added to the file during the next
year. The extent of overlap between the backlog and the current file is not yet known.
DATA BASE
The spectra used in the formation of the CNMR data base have been obtained from a
variety of sources, such as those described in the preceding section. A total of thirteen
laboratories in seven countries (2) are now involved in the collection of 'raw' data from
their own Yes and also from the open literature. These spectra are then pooled, in Europe
by W. Bremser of BASF in Ludwigshafen, Germany, and in the US by C. L. Wilkins. Data
are exchanged between BASE and the University of Nebraska using an exchange format
designed by Bremser (3). This format, which is used only for the exchange of data on
magnetic tape, contains the data elements that are used in the search system or the data
base, and is available upon request from SRH or GWAM.
Once a spectrum is obtained by the University of Nebraska, the compound name and
molecular formula is submitted to CAS. There the compound is identified and its CAS registry
number and CAS collective index name are returned to Nebraska where they are merged into
the growing file. The inclusion of the registry number is crucial because it permits linking
between entries in the CNMR data base and corresponding entries in other files of the NIH-EPA Chemical Information System (CIS) (4). In particular, it allows one to use the
Substructure Search components of the CIS to search the CNMR data base for specific structures or substructures.
When a new entry has been passed through CAS, the existing data base is checked for its
registry number and if it is absent the spectrum is evaluated and added to the data base. If the
registry number is already in the file, then a check for points of difference between the old and
new spectra is made. If there are significant differences, e.g. different solvents, more complete
assignments and so on, the new spectrum is added to the data base, but if the new entry is
substantially identical to the existing one it is not used. As will be seen below, this has a
bearing upon the cost of leasing the data base.
To date, 15 000 spectra have been obtained and are being examined in this way. About half
of these were collected in the US and Japan and the remainder were collected in Europe. It has
been agreed by all collaborating groups that the master merged file, which contains spectra
from all contributors, should be made available to the public in as many forms as are scientifically and practically feasible. The methods of dissemination which are now being used
include the following. First, a magnetic tape of the full data base is available from the NIC on
an annual lease basis (6). For an individual organization the cost of an annual lease is US$250.
A desirable goal for all concerned is to enlarge the data base and it has therefore been decided
that up to 50% of the annual lease fee can be 'paid' with new spectra. A credit of $5 is given
for each new spectrum judged to be acceptable for inclusion into the data base. Second, since
there is still a considerable value associated with data in the form of 'hard copy,' microfiche
of the data base are being produced by our German collaborators. These are expected to be
available at a nominal cost. Microfiche are inexpensive and easy to generate by computer, and
can be discarded as updates of the file become available. Third, printed compilations of CNMR
spectra may be published and offered for sale. Finally, an interactive search system based upon
the CNMR data is available in Europe and North America on a fee-for-service basis via an
international computer network. This system, which is described in detail below, has been
available for over a year and is now being used by some twenty laboratories. At present an annual subscription fee of $100 is charged for the use of this system.
SEARCH SYSTEM
The CNMR search system is a part of the NIH-EPA Chemical Information System (4) and is
very similar to the Mass Spectral Search System (1) which is another component of the larger
system. Searching through the CNMR data base can be accomplished using the options shown
in Table 1. Each of the options can be used at a fixed price. These transaction prices, which
are essentially cpu charges, are also given in Table 1.
Much of the software used in the CNMR search system was originally designed for use in the
MSSS. This has been beneficial in that such programs have been extensively tested and
debugged and also that users in many cases are familiar with the style of the dialog. Usage to
date of the CNMR search system is comparable to the early usage of MSSS, even though the
size of the CNMR data base is only about 20% that of the corresponding MSSS data base.
Table 1. Options of the CNMR Search System
Option | Purpose | Cost ($) |
SHIFT | To search by chemical shift | 1.00 |
MF | To search by full molecular formula | 1.00 |
PF | To search by partial molecular formula | 1.00 |
SPEC | To list and identify a spectrum | 1.00 |
REGN | To identify a spectrum | 0.25 |
CLERC | To identify a complete spectrum | 2.00 |
HELP | To obtain an explanation of an option | 0.25 |
NEWS | To list the current newsletter | 0,25 |
PRICE | To obtain a schedule of prices | 0.25 |
COM | To enter a comment or complaint | 1.00 |
EXIT | To exit from the program | --- |
The most useful of the search options is the SHIFT search. This program accepts a
chemical shift, expressed in ppm from TMS whose signal is arbitrarily defined as occurring
at 0 ppm. The program can also accept, but does not require, a permissible deviation from
this value and the multiplicity of the single frequency off-resonance decoupled signal, the
SFORD. If no deviation is entered, the program assigns a window of width 1.0 ppm about
the entered frequency. The SFORD multiplicity (S=singlet, D= doublet, T=triplet,
Q=quartet) is a measure of the number of protons attached to a carbon and may or may not
be known to the investigator. If a multiplicity value is entered, it will be used in the search,
but if this information is not available the search will be conducted without it
The program now searches through the inverted files of the data base for spectra containing
the data as l specified by the user, and reports back that a certain number of spectra match
the criteria as specified. The user is then given a choice of listing these entries, ending the
search or entering another chemical shift. If the list command is issued, the compound
name and file ID number of each hit is listed. After every ten entries are listed, the user is
asked if the listing should be continued. If a second shift is entered, the search is repeated
using this second value and the new list of drops is combined in a Boolean AND operation
with the existing list. The user is informed how many spectra contain both shifts and is
again given the choice between listing these, ending the search, or entering another shift.
An example of the SHIFT search is given in Fig. l Here the user enters a chemical shift of
197.5 ppm with a deviation of 0.5 ppm and an SFORD multiplicity value of S. This will be
matched by any spectrum in the data base containing a signal between 197.0 and 198.0
ppm, which, when subjected to off-resonance decoupling, appears as a singlet. There are 37
such spectra, too many to inspect conveniently, and so a second shift, 26.0 ppm, is entered.
The number of spectra matching both these shifts is reduced to 3, and a third shift, 137.0
ppm, reduces this list to a single hit, ID# 300, Ethanone, 1-phenyl- (acetophenone), CAS
registry number 98862.
Figure 1. The SHIFT search option of the CNMR search system.
The SHIFT search is thus automatically convergent and the rate of convergence depends
upon the characteristic nature of the entered shifts. Any entered shift which reduces the
number of shifts to zero is rejected an appropriate message is returned to the user, who may
then re-enter the shift with a different deviation and/or SFORD value. Alternatively, a
different shift may be entered, or the search may be terminated.
One of the results of the SHIFT search, or of other searches such as the molecular formula
search, is the file ID number of any spectrum which matches the input data. This number
can be used in the SPEC option, as shown in Fig. 2, to retrieve all the information
pertaining to the file entry in question. Thus a logical sequel to the SHIFT search of Fig. 1
would be the retrieval shown in Fig. 2. Here, the ID number of 300 is entered and the
program prints a numbered structural diagram of the compound in question, the name,
registry number, molecular weight and molecular formula, the source reference and the
solvent in which the spectrum was measured. This is followed by a listing of the chemical
shifts and, when available, their SFORD multiplicities, intensities and assignments according to the numbering system used in the structural diagram.
A generally useful method of searching through chemical data is by means of complete or
partial molecular formulae. The MF option of the CNMR search system prompts the user
for the molecular formula in question and retrieves the spectra of all compounds with that
formula. The molecular formula is entered in a standard fashion; atom-subscript pairs must
be entered as carbon first, hydrogen second and then in alphabetical order. When the search
is complete, the number of hits is reported to the user who can terminate the search or list
the ID numbers which have been found, together with the corresponding names and
registry numbers. If the list is lengthy, it is halted after every ten file entries and the user is
asked whether or not it should be continued: If more information about any particular entry
is desired, it must be sought using the SPEC option.
Search Type | Example | Retrieved |
(Element) | F | All fluorine-containing compounds |
(Element) (Number) | N3 | All compounds with 3 nitrogens |
(Element)(Range) | C3-7 | All compounds with 3-7 carbon atoms |
(*) (Element) | *Sxxxxx | All compounds not containing sulphur |
(*) (Element) (Number) | *Br4xxxx | All compounds with 1-3 or 5 or more bromines |
(*) (Element) (Range) | *Br1-3 | All compounds with zero or more than three bromines |
The partial molecular formula search, PF, can be used to find the spectra of compounds
with defined partial formulae. As can be seen from Table 2, a search may be conducted for
all compounds containing or not containing a specific element, or for those containing or
not containing specific numbers or ranges of numbers of specific elements. After each 'PA
search is completed, the number of hits is reported and the user is given the option of
listing the responding entries, terminating the search, entering further partial formula
details, or beginning a search based upon chemical shifts. In this last case, the SHIFT
search proceeds as described above, but any list of spectra having specified chemical shifts
is intersected with the list of spectra having the previously specified partial molecular
formulae before being presented to the user. An example of the PF/SHIFT option is given
in Fig. 3 where the spectra are sought of compounds containing any number of fluorine
atoms, and between 10 and 20 carbon atoms. These spectra are further to be limited to
those with a chemical shift between 149 and 151 ppm. Entry of the partial formula 'F
shows there to be 114 entries for luorine-containing compounds. Of these, only 11 are of
compounds with between 10 and 20 carbons, and only one of these, ID number 3826, has a
signal in the range 149-151 ppm.
A program is available to permit the comparison of complete unknown spectrum with each spectrum in the file. This program, which recognizes the absence, as well as the presence of peaks at specific frequencies, has, as its goal, the identification of the file spectra which most resemble the unknown. It is based upon an algorithm first developed by Clerc (7) and the name of this option of the CNMR search system is in fact CLERC. The user is asked to enter the frequencies of the signals in the unknown spectrum. The SFORD multiplicity values should also be entered if they are available. When all the shifts have been entered, a
dummy value of 999 is entered and the user is then given the opportunity to correct any errors in the input data. Once these data are pronounced to be correct the search begins. The best matches in the file are found, their respective goodness of fit values are calculated, and the best ten fits are reported to the user. A perfect match is given a fit value of 100. Values below about 85 indicate very poor matches and matches with a value below 75 are not even reported. An example of the CLERC search is given in Fig. 4, in which five distinct chemical shifts are entered. The best fit, a spectrum of l-propane, 3-(1,1-dimethylethoxy)- with a goodness of fit of 95.69, is followed by three closely related compounds, and then six others whose spectra match less well to the input spectrum. The shifts which
were entered had been rounded to the nearest whole number, and this accounts for the fit value of 96.59 rather than 100. This particular search option lends itself to batch
processing and, if the need arises, it is a simple matter to enter the input data and run the actual search later then lower computational costs can be obtained. This might prove to be a useful approach for those with many searches to carry out. In addition to the SPEC option for retrieval of CNMR data, there is a similar program which will retrieve all the entries corresponding to a specific CAS registry number. This number may be obtained from a
variety of sources such as other components of the CIS, or the open literature. The program
REGN accepts the appropriate registry number with hyphens and leading zeros omitted and
reports the ID numbers of any spectra corresponding to that registry number. There may be
more than one spectrum for a given compound because the spectra may have been measured under different experimental conditions. The actual spectra may then be retrieved
using the ID numbers with the SPEC option.
This option of the CNMR search system is linked directly with the Substructure Search
System (5) of the CIS. Once a particular structure or substructure has been identified in the
CNMR data base by the Substructure Search System, the command CNMR invokes the
REGN option and lists all the spectra associated with that substructure and its related
registry number or numbers.
This is one of the more advanced features of the CIS, perm~ttmg direct identification of the chemical shifts associated with carbons in specific chemical environments in a molecule. An example of this process is shown m Fig. 5 in which an iodophenyl substructure is created by means of the commands RING, ALTBD ABRAN and SATOM. All 11 occurrences of this fragment are found in the CNMR data base with the search option FPROB, and the SSHOW command permits listing of the eleven registry numbers. Finally
the CNMR command leads the user to the relevant spectra, the first of these being that of p-fluoroiodobenzene, registry number 352341.
The remaining options of the CNMR search system are utilities rather than search or
retrieval programs. A newsletter is maintained on the system and can be accessed by the
command NEWS. This is used to alert users as to changes in the data base or in the
computer network and also to announce the availability of new programs and so on. The
command OPT simply lists by name each option in the system along with their very brief
description. The command HELP provides brief operating instructions for the ME option
and then, at the user's discretion, will do the same for the PF, SHIFT or CLERC programs.
The CRAB option allows users to report errors or problems to the system managers, and
finally, the command OUT allows the user to leave the CNMR search system and return to
the computer monitor.
FILE STRUCTURE
The experience gained in designing and building other components of the CIS has made it
clear that the key to an efficient, rapid and inexpensive search system for a large file such
as the CNMR spectral data base requires a well-designed file structure. For the most part,
the file structure and system design used here employs the techniques first developed and
used in the Mass Spectral Search System (8). Since the file structure has been described in
some detail in the case of the MSSS, the reader is referred to that publication for this
information. The only difference between the MSSS and the CNMR systems is that the
latter uses octal notation rather than a decimal system as is used in MSSS. This is only a
trivial change, resulting in a slight file size reduction.
Since the structure of the inverted files is highly dependent on the bit length of the
computer word used, no description of the details of the CNMR files will be given here. As
can be seen from a comparison of the search examples given in this paper with the
corresponding MSSS searches (1) the file structure changes, if any, are transparent to the
user; the dialogue appears identical. For example, while all the programs continue to be
written in FORTRAN, the free form format of the OEC PDP-10 system allows for the
same input style for masses in MSSS (integers) or chemical shifts in CNMR (floating
point). The advantage in using the same proven approaches to development of the CIS
systems are clear
Acknowledgements
One of us (SRH) wishes to thank the EPA, Office of Planning and Management,
Management Information and Data Systems Division, (W. Greenstreet and M. Yaguda) as
well as the Office of International Activities (D. Gregory) for their support and assistance
in the initiation of this project.
References
1. S. R. Heller, H. M. Fales and G. W. A. Milne, Org. Mass Spectrom. 7, 107 (1973), S. R.
Heller, D. A. Koniver, H. M. Fales and G. W. A Milne, Anat. Chem. 46, 947 (1974); S. R.
Heller, R. J. Feldmann, H. M. Fales and G. W. A Milne, J. Chem. Doc. 13, 130 (1973); R.
S. Heller, G. W. A Milne, R. J. Feldmann and S. R. Heller, J. Chem. Inf. Comput. Sc;. 16,
176 (1976).
2. In addition to laboratories with which the authors are affiliated, these ;ndude BASF
(Ludwigshafen). Deutsche Krebsforschungszentrum (Heidelberg), Braker Physik
(Karlsnahe), Central Institute for Chemistry (Budapest), ETH (Zurich), University of Paris,
University of Utrecht (Netherlands) and Miyagi and Sendai Universities (Japan).
3. W. Bremser, unpublished work (1975).
4. S. R. Heller, G. W. A Milne and R. J. Feldmann, Science 195, 253 (1977).
5. R. J. Feldmann, G. W. A Milne, S. R. Heller. A. Fein, J. A Miller and B. Koch, J. Chem.
Inf. Comput Sci 17, 157 (1977).
6. For further information, please contact Dr Charles Citroen, NIC, Schoemakerstraat 97,
P.O. Box 36, 2600 AM Delft, The Netherlands.
7. R. Schwarzenbach, J. Meili, H. Koenitzer and J. T. Clerc, Org. Magn. Reson. 8, 11
(1976).
8. S. R. Heller, Anal. Chem. 44, 1951 (19721.
Received 18 November 1977; accepted (revised) 16 January 1978