A SUBSTRUCTURE ORIENTED 13C-NMR CHEMICAL SHIFT RETRIEVAL SYSTEM
J. ZUPAN and S. R. HELLER*
Environmental Protection Agency, Washington, D.C., 20460 (U.S.A.J
G. W. A. MILNE
National Institutes of Health, Bethesda, Md., 20014 (U.S.A.)
J. A. MILLER
Fein-Marquart Associates Inc., Towson, Md., 21212 (U.S.A.)
(Received 16th January 1978 )
SUMMARY
A computer program that uses on-line generated substructures of
organic compounds as input and retrieves the corresponding
distributions of 13C-n.m.r. chemical shifts is described and
discussed. The procedure of creating the substructures and the
main features of the retrieval philosophy are outlined. One
search is worked out in detail to demonstrate the ability of the
system.
The assigning of 13C-n.m.r. spectra, i.e. the identification of
the chemical shifts with the appropriate carbon atoms in the
chemical environment of the molecule, is a primary application of
13C-n.m.r. spectroscopy. Usually this is done by empirical rules
describing the influence of neighbors on the central atom in the
fragment in question, and by inspecting the spectrum to see if it
fits the proposed correlation. Several tables of shift
assignments have been published [1--3] since 13C-n.m.r.
spectroscopy became recognized as a powerful tool for the
elucidation of structures of organic compounds. Normally, such
tables give only the upper and lower limit of the range in which
the atom under consideration is expected to give a chemical
shift. Such a description is far from complete because it
provides no information as to whether the distribution of
chemical shifts is uniform or centered in the given interval as a
normal distribution.
It transpires that the distribution of the shifts in such an interval is not normal and is very dependent on the nature of neighbors more distant than the first and second neighbors, which are usually the only ones that are considered in the manual assembly of tables of this sort.
In this paper is described a complex computer program which
provides an easier and deeper insight into the distribution of
chemical shifts sampled from a fairly large data collection for
any specific structural environment.
DATA BASE AND IMPLEMENTATION
The NIH-EPA-NIC 13C-n.m.r. collection [4] was used as the
data base in this work. In addition to the 13C-n.m.r. data,
this data base contains the Chemical Abstract Registry
(CAS) number, the chemical name and molecular formula of
the compound. There is also associated with each compound a
picture of its structure in which the atoms have been
numbered. A typical display of an entry from this data
base, obtained with the standard retrieval system [5, 6],
is shown in Fig. 1.
The assignments of the chemical shifts were entered
manually, using the same numbering as in these structures,
with a specially written on-line program. Currently, the
data base contains 4,024 spectra, some 2,500 of which have
been assigned. Further work with the program described here
to add the missing assignments in the data base is in
progress.
In order to permit the user to define a chemical fragment
and conduct a substructure search for it, an additional
file is necessary. This file contains the connection tables
of all the compounds in the 13C-n.m.r. data base. The
building of substructures (fragments) can be done by using
the substructure program developed by Feldmann et al. [7].
The commands most frequently used in structure generation
are listed in Table 1.
From the point of view of the assignment of 13C-n.m.r. spectra, a very important option within the Substructure Search System is the TERMA command which allows substituents to be defined precisely. The command TERMA 3,1 for example, sets to one the numbers of neighbors of the atom with the number "3" in the query structure. It is obvious that the atom "3", whatever it is, must therefore be the last one in the chain. Without the use of the command "TERMA 3,1", structures containing atom "3" bound to 2, 3 or even more atoms would also be retrieved. Thus for the accurate definition of larger structures, the TERMA command should be used for each atom. There are, of course, many other commands for the substructure generation [8] but these are less frequently used in the present application.
The link between the Substructure Search System and 13C-n.m.r. files was provided by a fast double-hashing
algorithm with twin prime numbers NP and NP-2 [9] using the
CAS Registry number (NUMRG) as input to obtain the proper
key address, KEY, of the connection table or chemical
shifts:
NP = 4723
KEY = MOD (NUMRG,NP) + 1
INC = MOD (NUMRG, NP-2) + 2
1 CONTINUE
KEY = KEY - INC
IF (KEY .LE. 0) KEY = KEY + NP
.
.
.
If the requested item has not
been found on the address KEY GO TO 1
2 CONTINUE
Fig. 1. A typical display of the full information for the
requested ID number.
TABLE 1
There are 21 twin prime numbers between 4000 and 5000 and
the choice of 4723 was made because it is expected that
about 500 new spectra will soon be added to the current
data base of 4,024 spectra. About 5% free space in the
address table will still be available and so unsuccessful
searches will be terminated relatively rapidly.
The flow chart of the full process is given in Fig. 2. The
complete search is done in three steps. First, the
substructure fragment has to be built up on-line, using
commands such as those shown in Table 1. Secondly, the
search through the connection tables is performed in order
to obtain all compounds containing the described
substructure together with the numbering of the atoms in
the structure as it appears in the 13C-n.m.r. file.
Thirdly, the chemical shifts of the retrieved compounds are
inspected and those produced by the various atoms of the
query structure are statistically interpreted and reported.
Because CAS defines in the connection tables nine different types of bonds (chain-single, chain-double, chain-triple, chain-tautomer, ring-single, ring-double, ring-triple, ring-tautomer, and ring-alternating), and there is a
great variety of construction commands, it is possible to
build up any type of structural fragment. When the
searching is complete, the output routines permit the
inspection of the shift of any atom that was in the query
structure. The retrieved shifts and histograms of their
distributions can also be printed, as can the list of
unassigned compounds containing the same structural
fragment. These can be obtained on request if further
assignments are planned. An example of fragment
construction and the corresponding output possibilities is
shown in Table 2.
Although the collection of 13C-n.m.r. data is relatively
small, the amount of data to be searched is large, because
the average connection table consists of a 10 x 3 matrix
and on average, 7 chemical shifts/assignments per compound
have to be inspected. Great care has to be taken to
optimize the algorithms as well as the input/output
operations, which use random access files. The program is
written completely in DEC-10 FORTRAN and is incorporated in
the 13C-n.m.r. Search System [4, 5] developed for the NIHEPA Chemical Information System [6], which runs on DEC-10
computers. The on-line handling of data enables the users
to obtain the results very quickly. The commands in the
Substructure Search System are largely self-explanatory and
the system has many on-line HELP messages. As a result, the
learning period for most new users is very short, and the
program can be used efficiently after very few trials.
RESULTS AND DISCUSSION
The first part of Table 2 shows the way in which query
structures may be built. A structure can be added on the
right of the Table for clarity: it appears in the actual
on-line session only if requested by the option "D", for
"display". The abbreviations TC and CS stand for
"Tautomer.Chain" and "Chain Single", respectively. The
correct mnemonics for bond types can be retrieved by typing
"H" after the SBOND command. Both options FPROB (fragment
probe) and SUBSS 1 (substructure search on file 1) are
necessary to obtain the desired results from the files. In
principle, the command SUBSS could be issued alone, without
a previous FPROB, but it is very time-consuming because it
conducts a bond-by-bond and atom-by-atom comparison for
each structure. In practice, the System will not entertain
a SUBSS command on the whole file; SUBSS can only be used
with respect to a temporary file such as those that are
generated by searches such as FPROB. Computer time is saved
by prior use of the FPROB command. This causes the program
to search for atom-centered fragments and forms a temporary
(and smaller) file of candidates for SUBSS, whose work is
thus reduced by at least one order of magnitude. After the
substructure search has been done, the compounds found are
stored in a permanent file that can be used after the user
exits the substructure search.
The next step begins when the CNMR search program is called
and SUB (substructure) option is chosen. The program asks
the number of the atom sampling interval was used. The
corresponding histograms of the three extended fragments
containing the same central part, are shown. The contribution to the shift distribution of the specific
subgroups in the first histogram becomes instantly
apparent. The influence of the double bond between the
first and second neighbors is responsible for the shifts in
the region below 175 ppm, while the saturated carboxylic
acids have shifts that, on average, are about 10 ppm lower
downfield. It is also clear from these results that unless
the distribution is really "gaussian-like", the standard
deviations and upper or lower limits of the intervals are
rather poor descriptors of the shifts. The difference in
the number of compounds in the first histogram and the sum
of those in the other three arises because all carboxylic
acids, including those containing non-carbon atoms as the
second neighbors, are considered in the first example,
while in the other searches, only carbon is permitted as a
substituent on the non-carboxyl carbon. It is noteworthy
also that the substructure search has the command INCLA n,
that makes it possible to define the possible alternative
atoms to be considered as the neighbors on the same place.
This command is very useful for further investigation of
problems such as that in the example presented.
The most serious shortcoming of this system is related to
the number of compounds in the file. At this stage, with
about 4,000 spectra in hand, it is rather unrealistic to
expect good results when more than second-order neighbors
are included, although in some cases, such as the second
example in Fig. 3, this might still be valuable. In any
event, as the data base increases in size this shortcoming
will become less apparent.
One of us (J. Z.) acknowledges the partial support of this
work by the Research Community of Slovenia.
REFERENCES
1 F. W. Wehrli and T. Wirthlin, Interpretation of Carbon-13 NMR Spectra, Heyden,
London, 1976.
2 E. Pretsch, J. T. Clerc, J. Seibl, and W. Simon, Tabellen
zur Strukturaufklarungorganischer Verbindungen', Springer-Verlag, Berlin, New York, 1976, pp. B5--B10. 3 J. B.
Stothers, Carbon-13 NMR Spectroscopy, Academic Press, New
York, 1972. 4 CNMR Data Base, NIC, Delft, Holland (Attn: C.
Citroen, NIC, CID-NTO, PO Box 36
2600 AA, Delft, The Netherlands).
5 D. L. Dalrymple, C. L. WiLkins, G. W. A. Milne and S. R.
Heller, Org. Mag. Res.,11 (1978) 000.
6 S. R. Heller, G. W. A. Milne, and R. J. Feldmann,
Science, 253 (1977) 195.
7 R. J. Feldmann, G. W. A. Milne, S. R. Heller, A. Fein, J. A. Miller and B. Koch, J. Chem.
Inf. Comp. Sci., 17 (1977) 157.
8 J. A. Miller, Substructure Search System, Users Manual, Fein-Marquart Associates, Inc.,
7215 York Road, Baltimore, Md., 21212.
9 D. E. Knuth, The Art of Computer Programming, 2nd edn.,
Addison-Wesley, Reading, 1973, Vol. III, p. 125.