An on-line substructure oriented retrieval system for CNMR spectra is described and
discussed. Three dimensional diagrams can be obtained showing distributions of chemical shifts
as a function of substructure.
Identification of compounds from their CNMR spectra involves searching through large
data bases of such spectra. The efficiency and speed of the searching process is highly dependent
upon the organization of the data base and in order to make this task as convenient as possible for
the chemist, a CNMR data retrieval program (1-4) has been implemented with a subroutine
which permits the inspection of the distributions of signals from spectra of compounds with
given specific structural features.
A second aspect of this work was to provide a deeper insight into the correlation between
the structural fragments and the corresponding C-13 chemical shifts. This knowledge is very
desirable for the reconstruction of the spectrum of the compound. It can be used not only by
analytical laboratories but also by sophisticated retrieval systems (3-6).
The NIH-EPA-NIC 13C NMR data base (7), containing 4024 CNMR spectra was used.
Along with the chemical name, molecular formula, and chemical shifts, there is associated with
each compound in this data base a list of fragment codes (CIDS keys (8) assigned automatically
from the Chemical Abstract Service (CAS) connection tables (9, 10). A summary of the most
important, i.e. most frequently used CIDS keys is given a Table 1. Altogether 536 different CID
keys were necessary to code the entire data base, but only 35% of these appear more than 10
To obtain the fragment code-chemical shift correspondence, two separate files were
created. Firstly, a sequential file was created containing 250 groups, each group consisting of all
compounds having a chemical shift at 1, 2, 3,....and 250 ppm, respectively. Chemical shifts have
been rounded off to the near integer ppm. Secondly, a random-access file with CIDS keys for
each compound, having 2 to 36 items per compound (6 on the average) was generated. The link
between both files was the CAS Registry Number. A fast hashing algorithm (11) used this
number to located the address where the appropriate CIDS keys were stored. CID keys were
tightly packed (three in one 36-bit DEC lo word) so no more than 13 words were necessary to
store all the fragment codes for any compound.
Some most frequent CID keys. For the complete list and description see (8,9).
Description Code type Code value
Acrylic or cyclic compounds (A-C) 1 # of rings, 0 for acrylics
Number of cyclic nuclei (NCN) 2 # of nuclei
Direct attachment to cyclic nuclei 3 for each nuclei one DACN value
Extra cyclic features, for (EC1) 11 # of double b.
(EC2) 12 # of triple b.
(EC3) 13 # of >CH- or >C= types
(EC4) 14 # of >C< types
Specific functional groups (FG) 21 271 different groups
(e.g. -0-N= FG156)
Hydrocarbon groups (HR) 31-43 up to 76 items for each code type
(e.g. El-C=C-R HR6ER)
Nonspecific diatomic groups (ND) 51 66 different groups
(e.g. -Te- N-N- ND30)
Nonspecific monoatomic groups (NM) 61 11 different groups
(e.g. -Te- NM10)
Specific cyclic nuclei SCN) 71 134 different groups
(e.g. phenanthrene SCN126)
Generic cyclic nuclei keys (GCN) 81-86 # of appearances
The program is an option within the large on-line NIH/EPA CIS-CNMR retrieval program
system (3). The user is able to define the functional groups that are to be present in the
compounds whose spectra are to considered. There are three way in which this may be done:
1. Compare two groups: one with the selected fragment or combination of fragments, other the
remainder of the files.
2. Compare two groups as before, with the data base file shortened by the preselection of some
3. Compare up to 20 groups with different fragments requests at the same time.
In each case any group could contain up to 20 different CIDS keys. After the search is
completed the output is presented as a 3-dimensional picture of the spectra composed on the
basis of the statistical appearance of the chemical shifts in each ppm region according to the
CIDS code match of the input data (Fig. 1). After the picture is displayed on the terminal
monitor the user may re-display it with a "zoom" type capability as shown in Fig. 2-4, using very
simple commands such as: U(p) 10, L(eft) 25, I(n) 50, and so on. The user can thus penetrate
into the box of spectra presented and inspect each part of the display and compare spectra in a
very efficient and accurate way. If only one or two groups of substructures are considered the
usual 2-D picture could be displayed and inspected (Fig. 5).
In spite of the comparatively small size of the file (about 4000 CNMR spectra) it is
necessary to inspect altogether 29,000 chemical shifts representing 3500 different compounds
described by about 24,000 CIDS descriptions, and the on-line response can be slow. For a
typical search comparing 4 groups with 2 CID keys each it takes about 80-100 seconds in real
time (depending on computer system load to produce the display.
The program is completely in FORTRAN; however, the plot and I/O routine are DEC 10
computer specific programs.
There are three main features that make this system quite useful for analysts. Firstly, it is
substructure oriented: the input data consist of the structural codes of the desired fragments.
Secondly, by the preselection of some structural features (e.g. aromatic rings AND methyl group)
the general description of fragments containing both features (i.e. toluenes, xylenes, etc.) has
been achieved. Thirdly, having up to 20 different statistical descriptions on the screen at the
same time offers a variety of possibilities for use of the system. It is clear that this type of
program can only be used fully in conjunction with other options of the CNMR search system.
The CNMR system offers many different kinds of searches; for single or multiple shifts, identity
numbers, molecular formulas and so on. The 3-D option described here can be used as a single
spectrum search if a very precise description of the fragments is entered, but such an application
is not of practical value and in any event, leads only to retrieval of the spectrum without the
identification of the compound. In such case the compound could be identified using the SHIFT
option of the CNMR system, entering the retrieved chemical shift positions.
Beside the described abilities of the 3D option, this program has an educational value since
it can illustrate for students the correlation between the principal fragments and the
corresponding chemical shifts.
However, we are quite aware that the CNMR "spectra" based on the pure statistical
counting of shift appearances are rather insufficient if the assignments of the shifts are not taken
into account. The use of the assignments requires (beside including this data into the master
file), a fast connection or link in the program to deal with them. Work on this problem is under
way and we hope to report on this matter in the near future.
One of us (J.Z.) Acknowledges the partial support of the Research Community of Slovenia to this work.
(1) S. R. Heller, G. W. A. Milne, R. J. Feldman, J. Chem. Inf. Comp. Sci., 16, 235 (1976).
(2) G. W. A. Milne, S. R. Heller, in Computer-Assisted Structure Elucidation, Ed. D. H. Smith, ACS Symposium Series #54, 1977, p. 26-45.
(3) S. R. Heller, G. W. A. Milne, R. J. Feldmann, Science, 195, 253 (1977).
(4) D. L. Dalrymple, C. L. Wilkins, G. W. A. Milne, S. R. Heller, submitted for publication on J. Org. Magn. Res.
(5) D. H. Smith, L. M. Masinter, N. S. Sridharan, in Computer Representation and Manipulation of Chemical Information, Eds.: W. R. Wipke, S. R. Heller, R. J. Feldmann, E. Hyde, Willey & Sons, New York, 1974, p. 287-315.
(6) P. R. Naegeli, J. T. Clerc, Anal. Chem., 45, 739A (1974).
(7) CNMR Data Base, NIC, Delft, Holland (Attn..: C. Citroen, NIC, CID-NTO, PO Box 36 2600 AA, Delft, The Netherlands).
(8) Handbook of CIDS Chemical Search Keys, Fein-Marquart Associates, Inc., 7215 York Road, Baltimore, MD., 21212, November 1973.
(9) J. A. Miller, Substructure Search System, Fein-Marquart Associate, Inc., 7215 York Road, Baltimore, MD., 21212, January 1976.
(10) R. J. Feldmann, G. W. A. Milne, S. R. Heller, A. Fein, J. A. Miller, B. Koch, J. Chem. Inf. Com. Sci., 17, 157 (1977).
(11) D. E. Knuth, The Art of Computer Programming, Sec. Ed., Addison Wesley, Reading,
1973, Vol.II, p.78.
Obravnavamo on-line iskalni sistem za CNMR spektre. Sistem je osnovan na ureditvi
zbirke v substrukture. Porazdelitev kemijskih premikov prikazemo kot funkcijo substrukture v
Fig. 1. The display produced at the beginning shows an overall impression from all "spectra"
obtained by requested substructures. In the particular case only four groups were requested:
acyclic, carbonyl in a acyclic, oxygen, and benzene compounds.
Fig. 2-4. The user has the possibility to "move" in each direction (in-out, left-right, and up-down) up to 100 steps in order to get better inspection of the distributions. The steps, or
commands, in order to obtain Figs. 2-4 were I(n) 5; R (ight) 50 and I(n) 40; and L(eft) 150, I(n)
10, and D(own) 20 respectively.
Fig. 5. 2-dimensional picture could be obtained if the distributions of one or two groups are required only. The figures 1346 and 826 represent the total number of compounds considered in the search procedure having the required set of CIDS keys. The ordinate in this and in all other diagrams represents the relative frequency of peak occurrences in the appropriate chemical shift region. In the case of benzene fragment (diag. B) this means that 364 compounds have the chemical shift at 128 ppm (marked as 100%).
(*) On leave from Chemical Institute Boris Kidric, 61000 Ljubljana, Yugoslavia