A Carbon-13 Nuclear Magnetic Resonance Spectral Data Base and Search System

D. L. Dalryrnple

Nicolet Technology Corporation, Mountain View, California 94041, USA

C. L. Wilkins

Department of Chemistry, university of Nebraska, Lincoln, Nebraska 68588, USA

G. W. A. Milne*

National Heart, Lung and Blood Institute, National Institutes of Health, Bethesda, Maryland 2001 '

USA

S. R. Heller*

Environmental Protection Agency, PM-218, Washington D.C., 20460, USA

* Authors to whom correspondence should be addressed

A data base containing approximately 4000 13C nuclear magnetic resonance spectra has been assembled. The spectra have been evaluated and all the corresponding compounds have been registered by the Chemical Abstracts Service (CAS). The data base is available to the international scientific community on magnetic tape or microfiche and is also the basis of a march item operating upon an international computer network.

BACKGROUND

As the considerable utility of 13C NMR (CNMR) spectroscopy in biochemical, biological and environmental research has emerged, it has become clear that a large, readily available file of high quality reference CNMR spectra would be of great value as an adjunct to such work. No data base of this sort was available and it was a fairly straightforward procedure to use the experience gathered from our work in connection with the data base of the EPA-NIH Mass Spectral Search System (l) to define the criteria necessary for the generation of a CNMR data base.

Nomenclature must be handled according to an accepted standard, and for this reason, it was decided to secure Chemical Abstracts Service (CAS) registry numbers and names for all compounds. Spectra that were to be used in the data base must be evaluated one by one by professional spectroscopists. The evaluation consists of the following steps. The name, molecular weight, molecular formula and structure of the compound are checked to ensure that they agree with one another. The spectrum is then reviewed to ascertain that the number of lines is either consistent with the structure, or that any inconsistencies can be rationalized. All assignments provided are inspected and those that seemed clearly in error are then confirmed at the original source, or the assignment is deleted. Off-resonance decoupling data are checked in the same way. Finally, a number of decisions must be made, as described below, as to which spectra to retain and which to discard

The cost of such a project, while considerable, is justified in terms of the promise it gives of more rapid and accurate identification of unknown organic compounds by members of the research community. In addition, it is expected that, like the Mass Spectral Search System (MSSS), this CNMR data base and search system will, as development decreases and use increases, become largely self-supporting. Thus, in the spring of 1975, EPA and NIH initiated a joint CNMR spectral data project. A contract calling for the coordination and evaluation of the data base was awarded to the University of Nebraska and programs for interactive searching through the data base were written by one of us (DLD).

At about this time, a joint consortium was organized to do much the same work in Europe and as a result of joining of forces, the project quickly grew into an international collaborative effort. Currently, data for the CNMR file are being collected by scientists in the United States, the Netherlands, Germany, Switzerland, France, Hungary and Japan. Managerial responsibility for the entire data base and search system has been assumed by the Netherlands Information Combine (NIC), a part of the Royal Dutch Chemical Society. Copies of the data base are available from the NIC and the complete search system is accessible by local telephone call via an international computer network. At present, there are approximately 4000 spectra in the data base, representing some 3900 compounds. Several compounds are represented more than once in the data base since their spectra may be run under different experimental conditions in which parameters such as temperature or solvent may be changed.

The major source of all the spectra is the open literature. Several collections of spectra have been incorporated into the data base. For example, about 1400 spectra were obtained from a file built by Dalrymple at the University of Delaware, 400 were derived from the files of Clerc at ETH, Zurich and 900 were obtained by data measured by Roberts's group at the California Institute of Technology and collected by Dorman of Eli Lilly Inc. Perhaps 10% of all file entries are of unpublished spectra. These are submitted directly to the University of Nebraska by the person who measures them.

A backlog of about 9000 spectra is in hand and should be added to the file during the next year. The extent of overlap between the backlog and the current file is not yet known.

DATA BASE

The spectra used in the formation of the CNMR data base have been obtained from a variety of sources, such as those described in the preceding section. A total of thirteen laboratories in seven countries (2) are now involved in the collection of 'raw' data from their own Yes and also from the open literature. These spectra are then pooled, in Europe by W. Bremser of BASF in Ludwigshafen, Germany, and in the US by C. L. Wilkins. Data are exchanged between BASE and the University of Nebraska using an exchange format designed by Bremser (3). This format, which is used only for the exchange of data on magnetic tape, contains the data elements that are used in the search system or the data base, and is available upon request from SRH or GWAM.

Once a spectrum is obtained by the University of Nebraska, the compound name and molecular formula is submitted to CAS. There the compound is identified and its CAS registry number and CAS collective index name are returned to Nebraska where they are merged into the growing file. The inclusion of the registry number is crucial because it permits linking between entries in the CNMR data base and corresponding entries in other files of the NIH-EPA Chemical Information System (CIS) (4). In particular, it allows one to use the Substructure Search components of the CIS to search the CNMR data base for specific structures or substructures.

When a new entry has been passed through CAS, the existing data base is checked for its registry number and if it is absent the spectrum is evaluated and added to the data base. If the registry number is already in the file, then a check for points of difference between the old and new spectra is made. If there are significant differences, e.g. different solvents, more complete assignments and so on, the new spectrum is added to the data base, but if the new entry is substantially identical to the existing one it is not used. As will be seen below, this has a bearing upon the cost of leasing the data base.

To date, 15 000 spectra have been obtained and are being examined in this way. About half of these were collected in the US and Japan and the remainder were collected in Europe. It has been agreed by all collaborating groups that the master merged file, which contains spectra from all contributors, should be made available to the public in as many forms as are scientifically and practically feasible. The methods of dissemination which are now being used include the following. First, a magnetic tape of the full data base is available from the NIC on an annual lease basis (6). For an individual organization the cost of an annual lease is US$250. A desirable goal for all concerned is to enlarge the data base and it has therefore been decided that up to 50% of the annual lease fee can be 'paid' with new spectra. A credit of $5 is given for each new spectrum judged to be acceptable for inclusion into the data base. Second, since there is still a considerable value associated with data in the form of 'hard copy,' microfiche of the data base are being produced by our German collaborators. These are expected to be available at a nominal cost. Microfiche are inexpensive and easy to generate by computer, and can be discarded as updates of the file become available. Third, printed compilations of CNMR spectra may be published and offered for sale. Finally, an interactive search system based upon the CNMR data is available in Europe and North America on a fee-for-service basis via an international computer network. This system, which is described in detail below, has been available for over a year and is now being used by some twenty laboratories. At present an annual subscription fee of $100 is charged for the use of this system.

SEARCH SYSTEM

The CNMR search system is a part of the NIH-EPA Chemical Information System (4) and is very similar to the Mass Spectral Search System (1) which is another component of the larger system. Searching through the CNMR data base can be accomplished using the options shown in Table 1. Each of the options can be used at a fixed price. These transaction prices, which are essentially cpu charges, are also given in Table 1.

Much of the software used in the CNMR search system was originally designed for use in the MSSS. This has been beneficial in that such programs have been extensively tested and debugged and also that users in many cases are familiar with the style of the dialog. Usage to date of the CNMR search system is comparable to the early usage of MSSS, even though the size of the CNMR data base is only about 20% that of the corresponding MSSS data base.

Table 1. Options of the CNMR Search System





Option Purpose Cost ($)
SHIFT To search by chemical shift 1.00
MF To search by full molecular formula 1.00
PF To search by partial molecular formula 1.00
SPEC To list and identify a spectrum 1.00
REGN To identify a spectrum 0.25
CLERC To identify a complete spectrum 2.00
HELP To obtain an explanation of an option 0.25
NEWS To list the current newsletter 0,25
PRICE To obtain a schedule of prices 0.25
COM To enter a comment or complaint 1.00
EXIT To exit from the program ---


The most useful of the search options is the SHIFT search. This program accepts a chemical shift, expressed in ppm from TMS whose signal is arbitrarily defined as occurring at 0 ppm. The program can also accept, but does not require, a permissible deviation from this value and the multiplicity of the single frequency off-resonance decoupled signal, the SFORD. If no deviation is entered, the program assigns a window of width 1.0 ppm about the entered frequency. The SFORD multiplicity (S=singlet, D= doublet, T=triplet, Q=quartet) is a measure of the number of protons attached to a carbon and may or may not be known to the investigator. If a multiplicity value is entered, it will be used in the search, but if this information is not available the search will be conducted without it

The program now searches through the inverted files of the data base for spectra containing the data as l specified by the user, and reports back that a certain number of spectra match the criteria as specified. The user is then given a choice of listing these entries, ending the search or entering another chemical shift. If the list command is issued, the compound name and file ID number of each hit is listed. After every ten entries are listed, the user is asked if the listing should be continued. If a second shift is entered, the search is repeated using this second value and the new list of drops is combined in a Boolean AND operation with the existing list. The user is informed how many spectra contain both shifts and is again given the choice between listing these, ending the search, or entering another shift.

An example of the SHIFT search is given in Fig. l Here the user enters a chemical shift of 197.5 ppm with a deviation of 0.5 ppm and an SFORD multiplicity value of S. This will be matched by any spectrum in the data base containing a signal between 197.0 and 198.0 ppm, which, when subjected to off-resonance decoupling, appears as a singlet. There are 37 such spectra, too many to inspect conveniently, and so a second shift, 26.0 ppm, is entered. The number of spectra matching both these shifts is reduced to 3, and a third shift, 137.0 ppm, reduces this list to a single hit, ID# 300, Ethanone, 1-phenyl- (acetophenone), CAS registry number 98862.

Figure 1. The SHIFT search option of the CNMR search system.



The SHIFT search is thus automatically convergent and the rate of convergence depends upon the characteristic nature of the entered shifts. Any entered shift which reduces the number of shifts to zero is rejected an appropriate message is returned to the user, who may then re-enter the shift with a different deviation and/or SFORD value. Alternatively, a different shift may be entered, or the search may be terminated.

One of the results of the SHIFT search, or of other searches such as the molecular formula search, is the file ID number of any spectrum which matches the input data. This number can be used in the SPEC option, as shown in Fig. 2, to retrieve all the information pertaining to the file entry in question. Thus a logical sequel to the SHIFT search of Fig. 1 would be the retrieval shown in Fig. 2. Here, the ID number of 300 is entered and the program prints a numbered structural diagram of the compound in question, the name, registry number, molecular weight and molecular formula, the source reference and the solvent in which the spectrum was measured. This is followed by a listing of the chemical shifts and, when available, their SFORD multiplicities, intensities and assignments according to the numbering system used in the structural diagram.

A generally useful method of searching through chemical data is by means of complete or partial molecular formulae. The MF option of the CNMR search system prompts the user for the molecular formula in question and retrieves the spectra of all compounds with that formula. The molecular formula is entered in a standard fashion; atom-subscript pairs must be entered as carbon first, hydrogen second and then in alphabetical order. When the search is complete, the number of hits is reported to the user who can terminate the search or list the ID numbers which have been found, together with the corresponding names and registry numbers. If the list is lengthy, it is halted after every ten file entries and the user is asked whether or not it should be continued: If more information about any particular entry is desired, it must be sought using the SPEC option.

Search Type Example Retrieved
(Element) F All fluorine-containing compounds
(Element) (Number) N3 All compounds with 3 nitrogens
(Element)(Range) C3-7 All compounds with 3-7 carbon atoms
(*) (Element) *Sxxxxx All compounds not containing sulphur
(*) (Element) (Number) *Br4xxxx All compounds with 1-3 or 5 or more bromines
(*) (Element) (Range) *Br1-3 All compounds with zero or more than three bromines






The partial molecular formula search, PF, can be used to find the spectra of compounds with defined partial formulae. As can be seen from Table 2, a search may be conducted for all compounds containing or not containing a specific element, or for those containing or not containing specific numbers or ranges of numbers of specific elements. After each 'PA search is completed, the number of hits is reported and the user is given the option of listing the responding entries, terminating the search, entering further partial formula details, or beginning a search based upon chemical shifts. In this last case, the SHIFT search proceeds as described above, but any list of spectra having specified chemical shifts is intersected with the list of spectra having the previously specified partial molecular formulae before being presented to the user. An example of the PF/SHIFT option is given in Fig. 3 where the spectra are sought of compounds containing any number of fluorine atoms, and between 10 and 20 carbon atoms. These spectra are further to be limited to those with a chemical shift between 149 and 151 ppm. Entry of the partial formula 'F shows there to be 114 entries for luorine-containing compounds. Of these, only 11 are of compounds with between 10 and 20 carbons, and only one of these, ID number 3826, has a signal in the range 149-151 ppm.

A program is available to permit the comparison of complete unknown spectrum with each spectrum in the file. This program, which recognizes the absence, as well as the presence of peaks at specific frequencies, has, as its goal, the identification of the file spectra which most resemble the unknown. It is based upon an algorithm first developed by Clerc (7) and the name of this option of the CNMR search system is in fact CLERC. The user is asked to enter the frequencies of the signals in the unknown spectrum. The SFORD multiplicity values should also be entered if they are available. When all the shifts have been entered, a

dummy value of 999 is entered and the user is then given the opportunity to correct any errors in the input data. Once these data are pronounced to be correct the search begins. The best matches in the file are found, their respective goodness of fit values are calculated, and the best ten fits are reported to the user. A perfect match is given a fit value of 100. Values below about 85 indicate very poor matches and matches with a value below 75 are not even reported. An example of the CLERC search is given in Fig. 4, in which five distinct chemical shifts are entered. The best fit, a spectrum of l-propane, 3-(1,1-dimethylethoxy)- with a goodness of fit of 95.69, is followed by three closely related compounds, and then six others whose spectra match less well to the input spectrum. The shifts which

were entered had been rounded to the nearest whole number, and this accounts for the fit value of 96.59 rather than 100. This particular search option lends itself to batch

processing and, if the need arises, it is a simple matter to enter the input data and run the actual search later then lower computational costs can be obtained. This might prove to be a useful approach for those with many searches to carry out. In addition to the SPEC option for retrieval of CNMR data, there is a similar program which will retrieve all the entries corresponding to a specific CAS registry number. This number may be obtained from a

variety of sources such as other components of the CIS, or the open literature. The program REGN accepts the appropriate registry number with hyphens and leading zeros omitted and reports the ID numbers of any spectra corresponding to that registry number. There may be more than one spectrum for a given compound because the spectra may have been measured under different experimental conditions. The actual spectra may then be retrieved using the ID numbers with the SPEC option.

This option of the CNMR search system is linked directly with the Substructure Search System (5) of the CIS. Once a particular structure or substructure has been identified in the CNMR data base by the Substructure Search System, the command CNMR invokes the REGN option and lists all the spectra associated with that substructure and its related registry number or numbers.

This is one of the more advanced features of the CIS, perm~ttmg direct identification of the chemical shifts associated with carbons in specific chemical environments in a molecule. An example of this process is shown m Fig. 5 in which an iodophenyl substructure is created by means of the commands RING, ALTBD ABRAN and SATOM. All 11 occurrences of this fragment are found in the CNMR data base with the search option FPROB, and the SSHOW command permits listing of the eleven registry numbers. Finally

the CNMR command leads the user to the relevant spectra, the first of these being that of p-fluoroiodobenzene, registry number 352341.

The remaining options of the CNMR search system are utilities rather than search or retrieval programs. A newsletter is maintained on the system and can be accessed by the command NEWS. This is used to alert users as to changes in the data base or in the computer network and also to announce the availability of new programs and so on. The command OPT simply lists by name each option in the system along with their very brief description. The command HELP provides brief operating instructions for the ME option and then, at the user's discretion, will do the same for the PF, SHIFT or CLERC programs. The CRAB option allows users to report errors or problems to the system managers, and finally, the command OUT allows the user to leave the CNMR search system and return to the computer monitor.

FILE STRUCTURE

The experience gained in designing and building other components of the CIS has made it clear that the key to an efficient, rapid and inexpensive search system for a large file such as the CNMR spectral data base requires a well-designed file structure. For the most part, the file structure and system design used here employs the techniques first developed and used in the Mass Spectral Search System (8). Since the file structure has been described in some detail in the case of the MSSS, the reader is referred to that publication for this information. The only difference between the MSSS and the CNMR systems is that the latter uses octal notation rather than a decimal system as is used in MSSS. This is only a trivial change, resulting in a slight file size reduction.

Since the structure of the inverted files is highly dependent on the bit length of the computer word used, no description of the details of the CNMR files will be given here. As can be seen from a comparison of the search examples given in this paper with the corresponding MSSS searches (1) the file structure changes, if any, are transparent to the user; the dialogue appears identical. For example, while all the programs continue to be written in FORTRAN, the free form format of the OEC PDP-10 system allows for the same input style for masses in MSSS (integers) or chemical shifts in CNMR (floating point). The advantage in using the same proven approaches to development of the CIS systems are clear

Acknowledgements

One of us (SRH) wishes to thank the EPA, Office of Planning and Management, Management Information and Data Systems Division, (W. Greenstreet and M. Yaguda) as well as the Office of International Activities (D. Gregory) for their support and assistance in the initiation of this project.



References

1. S. R. Heller, H. M. Fales and G. W. A. Milne, Org. Mass Spectrom. 7, 107 (1973), S. R. Heller, D. A. Koniver, H. M. Fales and G. W. A Milne, Anat. Chem. 46, 947 (1974); S. R. Heller, R. J. Feldmann, H. M. Fales and G. W. A Milne, J. Chem. Doc. 13, 130 (1973); R. S. Heller, G. W. A Milne, R. J. Feldmann and S. R. Heller, J. Chem. Inf. Comput. Sc;. 16, 176 (1976).



2. In addition to laboratories with which the authors are affiliated, these ;ndude BASF (Ludwigshafen). Deutsche Krebsforschungszentrum (Heidelberg), Braker Physik (Karlsnahe), Central Institute for Chemistry (Budapest), ETH (Zurich), University of Paris, University of Utrecht (Netherlands) and Miyagi and Sendai Universities (Japan).

3. W. Bremser, unpublished work (1975).

4. S. R. Heller, G. W. A Milne and R. J. Feldmann, Science 195, 253 (1977).

5. R. J. Feldmann, G. W. A Milne, S. R. Heller. A. Fein, J. A Miller and B. Koch, J. Chem. Inf. Comput Sci 17, 157 (1977).

6. For further information, please contact Dr Charles Citroen, NIC, Schoemakerstraat 97, P.O. Box 36, 2600 AM Delft, The Netherlands.

7. R. Schwarzenbach, J. Meili, H. Koenitzer and J. T. Clerc, Org. Magn. Reson. 8, 11 (1976).

8. S. R. Heller, Anal. Chem. 44, 1951 (19721.



Received 18 November 1977; accepted (revised) 16 January 1978