Interpretation of Mass Spectrometry Data Using Cluster Analysis--Alkyl Thiolesters

Stephen R. Heller1 and Chin L. Chang,

Heuristics Laboratory, National Institutes of Health, Bethesda, Md. 20014

Kenneth C. Chu

Computer Science Laboratory, Division of Computer Research and Technology, National Institutes of Health, Bethesda, Md. 20014

1Present address, Management Information and Data Systems Division, Environmental Protection Agency, Washington, D.C. 20460.

The interpretation of mass spectral data has followed these three main courses: Library searching (1), Pattern Recognition (2), and Artificial Intelligence (3). These techniques have as their goal the determination of the structure of the molecule from a given mass spectrum. We now wish to introduce the use of another technique, cluster analysis, as an aid for the interpretation of mass spectral data. Structure determination can be divided into several subgoals and in this presentation using cluster analysis one subgoal is considered--namely, functional group classification.

The various library searching techniques that determine structures are "non-intelligent" in the sense that they simply perform a task of comparing an unknown, by a variety of methods, to known compounds in a library file and indicating possible solutions. The value of such systems is dependent on the size of the library. Thus, as the library grows, so does the cost to store and process the data, as well as the real and elapsed time to obtain possible answers.

In an attempt to interpret mass spectra without resorting to a large library, two main techniques have been employed. The first, pattern recognition, involves using a data base or training set to devise a way to interpret spectra by "teaching" the computer which peaks and losses are "good" and "bad." The decision rules obtained from the training set are then used to "predict" the classifications of unknown spectra.

The second approach, used in the DENDRAL project, is to program empirically derived mass spec fragmentation rules. A list of possible structures is constructed and the spectra from these possibilities are generated; then the spectrum which is most similar to the unknown identifies it. This technique has been applied successfully to monofunctional acyclic amines and ethers (3).

Both these methods are based on the assumption that the classes to be studied are known. In cluster analysis, the usual approach is to allow the method to classify data into categories or clusters of its own making. Thus, it is sometimes referred to as "learning without a teacher" or unsupervised learning. In addition, there are cluster analysis procedures which are "supervised," but these have not been used here. Cluster analysis has the advantage of possibly finding new methods for understanding old or puzzling data.

The particular cluster analysis procedure used in this presentation is a graph-theoretical method called the shortest spanning path (SSP) (4). This procedure creates an ordered list of the sample points which reflects the minimum path through these sample points. Each sample point has 227 components (i.e., a 227-dimensional vector in space). Starting with an ordered list of the sample points, the algorithm iteratively reorders the list so that the resulting ordered list has the minimum sum of distances between adjacent sample points. This collection of minimum distances represents a short path through all the sample points, hence the name SSP. Thus, applying this procedure to the mass spectral data gives a linear ordering of the spectra which tends to cluster them.

EXPERIMENTAL

In the particular study undertaken, the data file consisted of 323 mass spectra of compounds containing only one sulfur atom and any other atoms in any amounts, taken from a master file of 8782 spectra. This subset file was generated using the imbedded molecular formula search routine of the DCRT/CIS Mass Spec Search System (5). The initial file consisted of 525 compounds. However, the file was reduced to 323 compounds by removing those spectra in the file which did not have peaks beginning at least at m/e = 26. In addition, duplicates were not removed. The experimental spectra consisted of the peaks from m/e = 14 to 140 and all losses from the parent ion to M - 99 which gave a total of 227 feature points. The choice of the features selected was arbitrary, and it may be necessary to modify the features used when attempts are made to cluster other compounds. By experimentation, it was found that by replacing the actual intensities, with (single) weighted intensities, better clusters appeared to be formed. The choice was quite arbitrary and might very well be expected to vary as other classes of compounds are studied. In this work on the sulfur compounds, peaks and losses in a spectrum with an intensity of 0.01-49% were given an intensity value of 1. Those peaks with an intensity of 50-100% were given an intensity value of 2. Those losses with an intensity of 50 100% were given an intensity value of 4. The programs for this work were written in FORTRAN and SAIL (an ALGOL type language), and all were run on a DEC PDP-10 computer. The clustering program for the 227 features of the 323 sulfur compounds required about 86K words of core and about 30 minutes of cpu time to run.

RESULTS

After the data, consisting of the 323 sulfur compounds, were formulated into a linear path by the SSP procedure, the data were divided into a number of linear segments by qsing the intuitive judgment of the chemists. (In later work it is hoped to automate this manual step.) Each of these linear segments, manually defined by a chemist, constitutes a cluster (class). From these segments, part of which is shown in Figure 1, one class has been tentatively defined as the alkyl thiol esters, and this class, which has been investigated in depth, will be discussed here. The master file of 8782 spectra contained 45 (actually 46, but one spectrum was found to be incorrect) monofunctional straight-chain alkyl thiol esters of the general formula:

MW MF NAME
277 C9.H12.N.O5.P.S SUMITHION
301 C10.H12.N3.O4.P.S OXYGEN ANALOG OF GUTHION
286 C14.H10.N2.O3.S 4,6-DIPHENYL-1,2,3,5-OXATHIADIAZINE-2,2 DIOXIDE
273 C11.H12.N.O3.CL.S CHLORMEZONE

(TRANCOPAL)

333 C13.H16.N3.O4.NA.S METHAMPRYONE

(DIPYRONE POWDER ULMER)

295 C14.H18.CL.N3.S CHLOROTHEN (TAGATHEN)
146 C8.H18.S 2,2,4,4-TETRAMETHYL-3-THIAPENTANE
146 C8.H18.S TERT-BUTYL SULFIDE
146 C7.H14.O.S ISO-BUTYL THIOL NOR-PROPANOATE
146 C7.H14.O.S NOR-BUTYL THIOL NOR-PROPANOATE
146 C7.H14.O.S ETHYL THIOL NOR-PROPANOATE
146 C7.H14.O.S ETHYL THIOL ISO-PROPANOATE
118 C5.H10.O.S ETHYL THIOL ACETATE
118 C5.H10.O.S METHYL THIOL NOR-BUTYRATE
118 C5.H10.O.S NOR-PROPYL THIOL ACETATE
118 C5.H10.O.S ISO-PROPYL THIOL ACETATE
90 C3.H6.O.S METHYL THIOL ACETATE
60 C.O.S CARBON OXYSULFIDE
64 O2.S SULFUR DIOXIDE
34 H2.S HYDROGEN SULFIDE
48 C.H4.S METHANETHIOL (METHYL MERCAPTAN)
94 C2.H6.02.S DIMETHYLSULFONE
90 CH.H6.N2.S BIS (METHYLIMINO) SULPHUR
78 C2.H6.O.S 2-MERCAPTO ETHANOL
78 C2.H6.O.S SULFOXIDE DIMETHYL


Figure 1. Part of reordered list of sulfur compounds after the SSP procedure had been applied

O

||

R1-C-S-R2

R1 = C1-C10; R2 = C1-C8







All these spectra were run on a Bendix TOF mass spectrometer (6). There were no thiol esters with aromatic or saturated rings of other functional groups. Thus, the classification rules are to be considered applicable only to this limited class of compounds.

The matrix array of the spectra features arranged in the linear order found by the SSP program was manually inspected, and 29 features were picked out and found to characterize alkyl thiol esters. The features consist of peaks and losses that were found to be always present or absent in the class. These 29 features were then processed against the entire 8782 spectra from the master file. Only 45 spectra, all thiol esters, out of 8782 were found to meet these 29 criteria. No additional compounds were found to meet this criteria. Thus, a rule based on these 29 features was able to separate alkyl thiol esters from any and every other class of compounds in the file. In further experimentation with these criteria, it was found that 13 of 29 features could be eliminated without finding additional spectra that met this criteria. The features eliminated were:

Peaks present: 45

Peaks absent: 51, 52, 65, 66, 80, 93, 106, 107, 108

Losses absent: 30, 31

Losses present: 89



Last, a spectrum of nor-heptyl thiol nonhexanoate thought to be a bad spectrum because it did not meet the criteria was re-run and found to meet the identification criteria derived from the cluster analysis.

Thus, the 16 features given below appear to be able to characterize straight-chain alkyl thiol esters. The criteria are:







ACKNOWLEDGMENT

We are indebted to D. Black for the TOF mass spectra. We wish to thank H. M. Fales, J. R. Slagle, M. Shapiro, and R. C. T. Lee, for thoughtful discussions and also wish to thank R. J. Feldmann for the SAIL program used in the SSP procedure.

Received for review July 23, 1973. Accepted December 26, 1973.



References

1) S. R. Heller, Chapter 8, Computer Representation and Manipulation of Chemical Information," W. T. Wipke, s R Heller, R. J. Feldmann, and E. Hyde, Ed., John Wiley, New York, N.Y., 1974.

2) P. C. Jurs, ibid., Chapter 11

.

(3) D. S. Smith, ibid., Chapter 12.

(4) J. R. Slagle, C L. Chang, and S. R. Heller, Ana/. Chem., 46, in press

(5) S. R. Heller, Ana/. Chem., 44, 1951 (1972).

6) W. H. McFadden, R. M. Seifert, and J. Wesserman, Ana/. Chem., 37, 560 (1965).