A SUBSTRUCTURE ORIENTED 13C-NMR CHEMICAL SHIFT RETRIEVAL SYSTEM

J. ZUPAN and S. R. HELLER*

Environmental Protection Agency, Washington, D.C., 20460 (U.S.A.J

G. W. A. MILNE

National Institutes of Health, Bethesda, Md., 20014 (U.S.A.)

J. A. MILLER

Fein-Marquart Associates Inc., Towson, Md., 21212 (U.S.A.)

(Received 16th January 1978 )

SUMMARY

A computer program that uses on-line generated substructures of organic compounds as input and retrieves the corresponding distributions of 13C-n.m.r. chemical shifts is described and discussed. The procedure of creating the substructures and the main features of the retrieval philosophy are outlined. One search is worked out in detail to demonstrate the ability of the system.

The assigning of 13C-n.m.r. spectra, i.e. the identification of the chemical shifts with the appropriate carbon atoms in the chemical environment of the molecule, is a primary application of 13C-n.m.r. spectroscopy. Usually this is done by empirical rules describing the influence of neighbors on the central atom in the fragment in question, and by inspecting the spectrum to see if it fits the proposed correlation. Several tables of shift assignments have been published [1--3] since 13C-n.m.r. spectroscopy became recognized as a powerful tool for the elucidation of structures of organic compounds. Normally, such tables give only the upper and lower limit of the range in which the atom under consideration is expected to give a chemical shift. Such a description is far from complete because it provides no information as to whether the distribution of chemical shifts is uniform or centered in the given interval as a normal distribution.

It transpires that the distribution of the shifts in such an interval is not normal and is very dependent on the nature of neighbors more distant than the first and second neighbors, which are usually the only ones that are considered in the manual assembly of tables of this sort.

In this paper is described a complex computer program which provides an easier and deeper insight into the distribution of chemical shifts sampled from a fairly large data collection for any specific structural environment.

DATA BASE AND IMPLEMENTATION

The NIH-EPA-NIC 13C-n.m.r. collection [4] was used as the data base in this work. In addition to the 13C-n.m.r. data, this data base contains the Chemical Abstract Registry (CAS) number, the chemical name and molecular formula of the compound. There is also associated with each compound a picture of its structure in which the atoms have been numbered. A typical display of an entry from this data base, obtained with the standard retrieval system [5, 6], is shown in Fig. 1.

The assignments of the chemical shifts were entered manually, using the same numbering as in these structures, with a specially written on-line program. Currently, the data base contains 4,024 spectra, some 2,500 of which have been assigned. Further work with the program described here to add the missing assignments in the data base is in progress.

In order to permit the user to define a chemical fragment and conduct a substructure search for it, an additional file is necessary. This file contains the connection tables of all the compounds in the 13C-n.m.r. data base. The building of substructures (fragments) can be done by using the substructure program developed by Feldmann et al. [7]. The commands most frequently used in structure generation are listed in Table 1.

From the point of view of the assignment of 13C-n.m.r. spectra, a very important option within the Substructure Search System is the TERMA command which allows substituents to be defined precisely. The command TERMA 3,1 for example, sets to one the numbers of neighbors of the atom with the number "3" in the query structure. It is obvious that the atom "3", whatever it is, must therefore be the last one in the chain. Without the use of the command "TERMA 3,1", structures containing atom "3" bound to 2, 3 or even more atoms would also be retrieved. Thus for the accurate definition of larger structures, the TERMA command should be used for each atom. There are, of course, many other commands for the substructure generation [8] but these are less frequently used in the present application.

The link between the Substructure Search System and 13C-n.m.r. files was provided by a fast double-hashing algorithm with twin prime numbers NP and NP-2 [9] using the CAS Registry number (NUMRG) as input to obtain the proper key address, KEY, of the connection table or chemical shifts:

NP = 4723

KEY = MOD (NUMRG,NP) + 1

INC = MOD (NUMRG, NP-2) + 2

1 CONTINUE

KEY = KEY - INC

IF (KEY .LE. 0) KEY = KEY + NP

If the requested item has not

been found on the address KEY GO TO 1

2 CONTINUE

Fig. 1. A typical display of the full information for the requested ID number.

TABLE 1

There are 21 twin prime numbers between 4000 and 5000 and the choice of 4723 was made because it is expected that about 500 new spectra will soon be added to the current data base of 4,024 spectra. About 5% free space in the address table will still be available and so unsuccessful searches will be terminated relatively rapidly.

The flow chart of the full process is given in Fig. 2. The complete search is done in three steps. First, the substructure fragment has to be built up on-line, using commands such as those shown in Table 1. Secondly, the search through the connection tables is performed in order to obtain all compounds containing the described substructure together with the numbering of the atoms in the structure as it appears in the 13C-n.m.r. file. Thirdly, the chemical shifts of the retrieved compounds are inspected and those produced by the various atoms of the query structure are statistically interpreted and reported.

Because CAS defines in the connection tables nine different types of bonds (chain-single, chain-double, chain-triple, chain-tautomer, ring-single, ring-double, ring-triple, ring-tautomer, and ring-alternating), and there is a

great variety of construction commands, it is possible to build up any type of structural fragment. When the searching is complete, the output routines permit the inspection of the shift of any atom that was in the query structure. The retrieved shifts and histograms of their distributions can also be printed, as can the list of unassigned compounds containing the same structural fragment. These can be obtained on request if further assignments are planned. An example of fragment construction and the corresponding output possibilities is shown in Table 2.

Although the collection of 13C-n.m.r. data is relatively small, the amount of data to be searched is large, because the average connection table consists of a 10 x 3 matrix and on average, 7 chemical shifts/assignments per compound have to be inspected. Great care has to be taken to optimize the algorithms as well as the input/output operations, which use random access files. The program is written completely in DEC-10 FORTRAN and is incorporated in the 13C-n.m.r. Search System [4, 5] developed for the NIHEPA Chemical Information System [6], which runs on DEC-10 computers. The on-line handling of data enables the users to obtain the results very quickly. The commands in the Substructure Search System are largely self-explanatory and the system has many on-line HELP messages. As a result, the learning period for most new users is very short, and the program can be used efficiently after very few trials.

RESULTS AND DISCUSSION

The first part of Table 2 shows the way in which query structures may be built. A structure can be added on the right of the Table for clarity: it appears in the actual on-line session only if requested by the option "D", for "display". The abbreviations TC and CS stand for "Tautomer.Chain" and "Chain Single", respectively. The correct mnemonics for bond types can be retrieved by typing "H" after the SBOND command. Both options FPROB (fragment probe) and SUBSS 1 (substructure search on file 1) are necessary to obtain the desired results from the files. In principle, the command SUBSS could be issued alone, without a previous FPROB, but it is very time-consuming because it conducts a bond-by-bond and atom-by-atom comparison for each structure. In practice, the System will not entertain a SUBSS command on the whole file; SUBSS can only be used with respect to a temporary file such as those that are generated by searches such as FPROB. Computer time is saved by prior use of the FPROB command. This causes the program to search for atom-centered fragments and forms a temporary (and smaller) file of candidates for SUBSS, whose work is thus reduced by at least one order of magnitude. After the substructure search has been done, the compounds found are stored in a permanent file that can be used after the user exits the substructure search.

The next step begins when the CNMR search program is called and SUB (substructure) option is chosen. The program asks the number of the atom sampling interval was used. The corresponding histograms of the three extended fragments containing the same central part, are shown. The contribution to the shift distribution of the specific subgroups in the first histogram becomes instantly apparent. The influence of the double bond between the first and second neighbors is responsible for the shifts in the region below 175 ppm, while the saturated carboxylic acids have shifts that, on average, are about 10 ppm lower downfield. It is also clear from these results that unless the distribution is really "gaussian-like", the standard deviations and upper or lower limits of the intervals are rather poor descriptors of the shifts. The difference in the number of compounds in the first histogram and the sum of those in the other three arises because all carboxylic acids, including those containing non-carbon atoms as the second neighbors, are considered in the first example, while in the other searches, only carbon is permitted as a substituent on the non-carboxyl carbon. It is noteworthy also that the substructure search has the command INCLA n, that makes it possible to define the possible alternative atoms to be considered as the neighbors on the same place. This command is very useful for further investigation of problems such as that in the example presented.

The most serious shortcoming of this system is related to the number of compounds in the file. At this stage, with about 4,000 spectra in hand, it is rather unrealistic to expect good results when more than second-order neighbors are included, although in some cases, such as the second example in Fig. 3, this might still be valuable. In any event, as the data base increases in size this shortcoming will become less apparent.

One of us (J. Z.) acknowledges the partial support of this work by the Research Community of Slovenia.

REFERENCES

1 F. W. Wehrli and T. Wirthlin, Interpretation of Carbon-13 NMR Spectra, Heyden,

London, 1976.

2 E. Pretsch, J. T. Clerc, J. Seibl, and W. Simon, Tabellen zur Strukturaufklarungorganischer Verbindungen', Springer-Verlag, Berlin, New York, 1976, pp. B5--B10. 3 J. B. Stothers, Carbon-13 NMR Spectroscopy, Academic Press, New York, 1972. 4 CNMR Data Base, NIC, Delft, Holland (Attn: C. Citroen, NIC, CID-NTO, PO Box 36

2600 AA, Delft, The Netherlands).

5 D. L. Dalrymple, C. L. WiLkins, G. W. A. Milne and S. R. Heller, Org. Mag. Res.,11 (1978) 000.

6 S. R. Heller, G. W. A. Milne, and R. J. Feldmann, Science, 253 (1977) 195.

7 R. J. Feldmann, G. W. A. Milne, S. R. Heller, A. Fein, J. A. Miller and B. Koch, J. Chem.

Inf. Comp. Sci., 17 (1977) 157.

8 J. A. Miller, Substructure Search System, Users Manual, Fein-Marquart Associates, Inc.,

7215 York Road, Baltimore, Md., 21212.

9 D. E. Knuth, The Art of Computer Programming, 2nd edn., Addison-Wesley, Reading, 1973, Vol. III, p. 125.