J. Zupan*, S. R. Heller and G. W. A. Milne

Environmental Protection Agency, Washington, D.C. 20460

National Institutes of Health, Bethesda, MD 20014

An on-line substructure oriented retrieval system for CNMR spectra is described and discussed. Three dimensional diagrams can be obtained showing distributions of chemical shifts as a function of substructure.


Identification of compounds from their CNMR spectra involves searching through large data bases of such spectra. The efficiency and speed of the searching process is highly dependent upon the organization of the data base and in order to make this task as convenient as possible for the chemist, a CNMR data retrieval program (1-4) has been implemented with a subroutine which permits the inspection of the distributions of signals from spectra of compounds with given specific structural features.

A second aspect of this work was to provide a deeper insight into the correlation between the structural fragments and the corresponding C-13 chemical shifts. This knowledge is very desirable for the reconstruction of the spectrum of the compound. It can be used not only by analytical laboratories but also by sophisticated retrieval systems (3-6).


The NIH-EPA-NIC 13C NMR data base (7), containing 4024 CNMR spectra was used. Along with the chemical name, molecular formula, and chemical shifts, there is associated with each compound in this data base a list of fragment codes (CIDS keys (8) assigned automatically from the Chemical Abstract Service (CAS) connection tables (9, 10). A summary of the most important, i.e. most frequently used CIDS keys is given a Table 1. Altogether 536 different CID keys were necessary to code the entire data base, but only 35% of these appear more than 10 times.

To obtain the fragment code-chemical shift correspondence, two separate files were created. Firstly, a sequential file was created containing 250 groups, each group consisting of all compounds having a chemical shift at 1, 2, 3,....and 250 ppm, respectively. Chemical shifts have been rounded off to the near integer ppm. Secondly, a random-access file with CIDS keys for each compound, having 2 to 36 items per compound (6 on the average) was generated. The link between both files was the CAS Registry Number. A fast hashing algorithm (11) used this number to located the address where the appropriate CIDS keys were stored. CID keys were tightly packed (three in one 36-bit DEC lo word) so no more than 13 words were necessary to store all the fragment codes for any compound.

Table 1

Some most frequent CID keys. For the complete list and description see (8,9).


Description Code type Code value

Acrylic or cyclic compounds (A-C) 1 # of rings, 0 for acrylics

Number of cyclic nuclei (NCN) 2 # of nuclei

Direct attachment to cyclic nuclei 3 for each nuclei one DACN value


Extra cyclic features, for (EC1) 11 # of double b.

chains only

(EC2) 12 # of triple b.

(EC3) 13 # of >CH- or >C= types

(EC4) 14 # of >C< types

Specific functional groups (FG) 21 271 different groups

(e.g. -0-N= FG156)

Hydrocarbon groups (HR) 31-43 up to 76 items for each code type

(e.g. El-C=C-R HR6ER)

Nonspecific diatomic groups (ND) 51 66 different groups

(e.g. -Te- N-N- ND30)

Nonspecific monoatomic groups (NM) 61 11 different groups

(e.g. -Te- NM10)

Specific cyclic nuclei SCN) 71 134 different groups

(e.g. phenanthrene SCN126)

Generic cyclic nuclei keys (GCN) 81-86 # of appearances


The program is an option within the large on-line NIH/EPA CIS-CNMR retrieval program system (3). The user is able to define the functional groups that are to be present in the compounds whose spectra are to considered. There are three way in which this may be done:

1. Compare two groups: one with the selected fragment or combination of fragments, other the remainder of the files.

2. Compare two groups as before, with the data base file shortened by the preselection of some structural features.

3. Compare up to 20 groups with different fragments requests at the same time.

In each case any group could contain up to 20 different CIDS keys. After the search is completed the output is presented as a 3-dimensional picture of the spectra composed on the basis of the statistical appearance of the chemical shifts in each ppm region according to the CIDS code match of the input data (Fig. 1). After the picture is displayed on the terminal monitor the user may re-display it with a "zoom" type capability as shown in Fig. 2-4, using very simple commands such as: U(p) 10, L(eft) 25, I(n) 50, and so on. The user can thus penetrate into the box of spectra presented and inspect each part of the display and compare spectra in a very efficient and accurate way. If only one or two groups of substructures are considered the usual 2-D picture could be displayed and inspected (Fig. 5).

In spite of the comparatively small size of the file (about 4000 CNMR spectra) it is necessary to inspect altogether 29,000 chemical shifts representing 3500 different compounds described by about 24,000 CIDS descriptions, and the on-line response can be slow. For a typical search comparing 4 groups with 2 CID keys each it takes about 80-100 seconds in real time (depending on computer system load to produce the display.

The program is completely in FORTRAN; however, the plot and I/O routine are DEC 10 computer specific programs.

There are three main features that make this system quite useful for analysts. Firstly, it is substructure oriented: the input data consist of the structural codes of the desired fragments. Secondly, by the preselection of some structural features (e.g. aromatic rings AND methyl group) the general description of fragments containing both features (i.e. toluenes, xylenes, etc.) has been achieved. Thirdly, having up to 20 different statistical descriptions on the screen at the same time offers a variety of possibilities for use of the system. It is clear that this type of program can only be used fully in conjunction with other options of the CNMR search system. The CNMR system offers many different kinds of searches; for single or multiple shifts, identity numbers, molecular formulas and so on. The 3-D option described here can be used as a single spectrum search if a very precise description of the fragments is entered, but such an application is not of practical value and in any event, leads only to retrieval of the spectrum without the identification of the compound. In such case the compound could be identified using the SHIFT option of the CNMR system, entering the retrieved chemical shift positions.

Beside the described abilities of the 3D option, this program has an educational value since it can illustrate for students the correlation between the principal fragments and the corresponding chemical shifts.

However, we are quite aware that the CNMR "spectra" based on the pure statistical counting of shift appearances are rather insufficient if the assignments of the shifts are not taken into account. The use of the assignments requires (beside including this data into the master file), a fast connection or link in the program to deal with them. Work on this problem is under way and we hope to report on this matter in the near future.


One of us (J.Z.) Acknowledges the partial support of the Research Community of Slovenia to this work.


(1) S. R. Heller, G. W. A. Milne, R. J. Feldman, J. Chem. Inf. Comp. Sci., 16, 235 (1976).

(2) G. W. A. Milne, S. R. Heller, in Computer-Assisted Structure Elucidation, Ed. D. H. Smith, ACS Symposium Series #54, 1977, p. 26-45.

(3) S. R. Heller, G. W. A. Milne, R. J. Feldmann, Science, 195, 253 (1977).

(4) D. L. Dalrymple, C. L. Wilkins, G. W. A. Milne, S. R. Heller, submitted for publication on J. Org. Magn. Res.

(5) D. H. Smith, L. M. Masinter, N. S. Sridharan, in Computer Representation and Manipulation of Chemical Information, Eds.: W. R. Wipke, S. R. Heller, R. J. Feldmann, E. Hyde, Willey & Sons, New York, 1974, p. 287-315.

(6) P. R. Naegeli, J. T. Clerc, Anal. Chem., 45, 739A (1974).

(7) CNMR Data Base, NIC, Delft, Holland (Attn..: C. Citroen, NIC, CID-NTO, PO Box 36 2600 AA, Delft, The Netherlands).

(8) Handbook of CIDS Chemical Search Keys, Fein-Marquart Associates, Inc., 7215 York Road, Baltimore, MD., 21212, November 1973.

(9) J. A. Miller, Substructure Search System, Fein-Marquart Associate, Inc., 7215 York Road, Baltimore, MD., 21212, January 1976.

(10) R. J. Feldmann, G. W. A. Milne, S. R. Heller, A. Fein, J. A. Miller, B. Koch, J. Chem. Inf. Com. Sci., 17, 157 (1977).

(11) D. E. Knuth, The Art of Computer Programming, Sec. Ed., Addison Wesley, Reading, 1973, Vol.II, p.78.


Obravnavamo on-line iskalni sistem za CNMR spektre. Sistem je osnovan na ureditvi zbirke v substrukture. Porazdelitev kemijskih premikov prikazemo kot funkcijo substrukture v trodimenzionalnih diagramih.


Fig. 1. The display produced at the beginning shows an overall impression from all "spectra" obtained by requested substructures. In the particular case only four groups were requested: acyclic, carbonyl in a acyclic, oxygen, and benzene compounds.

Fig. 2-4. The user has the possibility to "move" in each direction (in-out, left-right, and up-down) up to 100 steps in order to get better inspection of the distributions. The steps, or commands, in order to obtain Figs. 2-4 were I(n) 5; R (ight) 50 and I(n) 40; and L(eft) 150, I(n) 10, and D(own) 20 respectively.

Fig. 5. 2-dimensional picture could be obtained if the distributions of one or two groups are required only. The figures 1346 and 826 represent the total number of compounds considered in the search procedure having the required set of CIDS keys. The ordinate in this and in all other diagrams represents the relative frequency of peak occurrences in the appropriate chemical shift region. In the case of benzene fragment (diag. B) this means that 364 compounds have the chemical shift at 128 ppm (marked as 100%).

Sprejeto 10.1.1978

(*) On leave from Chemical Institute Boris Kidric, 61000 Ljubljana, Yugoslavia