Interpretation of Mass Spectrometry Data Using Cluster Analysis--Alkyl Thiolesters

Stephen R. Heller¹ and Chin L. Chang,

Heuristics Laboratory, National Institutes of Health, Bethesda, Md. 20014

Kenneth C. Chu

Computer Science Laboratory, Division of Computer Research and Technology, National Institutes of Health, Bethesda, Md. 20014

¹Present address, Management Information and Data Systems Division, Environmental Protection Agency, Washington, D.C. 20460.

The interpretation of mass spectral data has followed these three main courses: Library searching (1), Pattern Recognition (2), and Artificial Intelligence (3). These techniques have as their goal the determination of the structure of the molecule from a given mass spectrum. We now wish to introduce the use of another technique, cluster analysis, as an aid for the interpretation of mass spectral data. Structure determination can be divided into several subgoals and in this presentation using cluster analysis one subgoal is considered--namely, functional group classification.

The various library searching techniques that determine structures are "non-intelligent" in the sense that they simply perform a task of comparing an unknown, by a variety of methods, to known compounds in a library file and indicating possible solutions. The value of such systems is dependent on the size of the library. Thus, as the library grows, so does the cost to store and process the data, as well as the real and elapsed time to obtain possible answers.

In an attempt to interpret mass spectra without resorting to a large library, two main techniques have been employed. The first, pattern recognition, involves using a data base or training set to devise a way to interpret spectra by "teaching" the computer which peaks and losses are "good" and "bad." The decision rules obtained from the training set are then used to "predict" the classifications of unknown spectra.

The second approach, used in the DENDRAL project, is to program empirically derived mass spec fragmentation rules. A list of possible structures is constructed and the spectra from these possibilities are generated; then the spectrum which is most similar to the unknown identifies it. This technique has been applied successfully to monofunctional acyclic amines and ethers (3).

Both these methods are based on the assumption that the classes to be studied are known. In cluster analysis, the usual approach is to allow the method to classify data into categories or clusters of its own making. Thus, it is sometimes referred to as "learning without a teacher" or unsupervised learning. In addition, there are cluster analysis procedures which are "supervised," but these have not been used here. Cluster analysis has the advantage of possibly finding new methods for understanding old or puzzling data.

The particular cluster analysis procedure used in this presentation is a graph-theoretical method called the shortest spanning path (SSP) (4). This procedure creates an ordered list of the sample points which reflects the minimum path through these sample points. Each sample point has 227 components (i.e., a 227-dimensional vector in space). Starting with an ordered list of the sample points, the algorithm iteratively reorders the list so that the resulting ordered list has the minimum sum of distances between adjacent sample points. This collection of minimum distances represents a short path through all the sample points, hence the name SSP. Thus, applying this procedure to the mass spectral data gives a linear ordering of the spectra which tends to cluster them.

EXPERIMENTAL

In the particular study undertaken, the data file consisted of 323 mass spectra of compounds containing only one sulfur atom and any other atoms in any amounts, taken from a master file of 8782 spectra. This subset file was generated using the imbedded molecular formula search routine of the DCRT/CIS Mass Spec Search System (5). The initial file consisted of 525 compounds. However, the file was reduced to 323 compounds by removing those spectra in the file which did not have peaks beginning at least at m/e = 26. In addition, duplicates were not removed. The experimental spectra consisted of the peaks from m/e = 14 to 140 and all losses from the parent ion to M - 99 which gave a total of 227 feature points. The choice of the features selected was arbitrary, and it may be necessary to modify the features used when attempts are made to cluster other compounds. By experimentation, it was found that by replacing the actual intensities, with (single) weighted intensities, better clusters appeared to be formed. The choice was quite arbitrary and might very well be expected to vary as other classes of compounds are studied. In this work on the sulfur compounds, peaks and losses in a spectrum with an intensity of 0.01-49% were given an intensity value of 1. Those peaks with an intensity of 50-100% were given an intensity value of 2. Those losses with an intensity of 50 100% were given an intensity value of 4. The programs for this work were written in FORTRAN and SAIL (an ALGOL type language), and all were run on a DEC PDP-10 computer. The clustering program for the 227 features of the 323 sulfur compounds required about 86K words of core and about 30 minutes of cpu time to run.

RESULTS

After the data, consisting of the 323 sulfur compounds, were formulated into a linear path by the SSP procedure, the data were divided into a number of linear segments by qsing the intuitive judgment of the chemists. (In later work it is hoped to automate this manual step.) Each of these linear segments, manually defined by a chemist, constitutes a cluster (class). From these segments, part of which is shown in Figure 1, one class has been tentatively defined as the alkyl thiol esters, and this class, which has been investigated in depth, will be discussed here. The master file of 8782 spectra contained 45 (actually 46, but one spectrum was found to be incorrect) monofunctional straight-chain alkyl thiol esters of the general formula:

MW MF NAME

277 C9.H12.N.O5.P.S SUMITHION

301 C10.H12.N3.O4.P.S OXYGEN ANALOG OF GUTHION

286 C14.H10.N2.O3.S 4,6-DIPHENYL-1,2,3,5-OXATHIADIAZINE-2,2 DIOXIDE

273 C11.H12.N.O3.CL.S CHLORMEZONE
(TRANCOPAL)

333 C13.H16.N3.O4.NA.S METHAMPRYONE
(DIPYRONE POWDER ULMER)

295 C14.H18.CL.N3.S CHLOROTHEN (TAGATHEN)

146 C8.H18.S 2,2,4,4-TETRAMETHYL-3-THIAPENTANE

146 C8.H18.S TERT-BUTYL SULFIDE

146 C7.H14.O.S ISO-BUTYL THIOL NOR-PROPANOATE

146 C7.H14.O.S NOR-BUTYL THIOL NOR-PROPANOATE

146 C7.H14.O.S ETHYL THIOL NOR-PROPANOATE

146 C7.H14.O.S ETHYL THIOL ISO-PROPANOATE

118 C5.H10.O.S ETHYL THIOL ACETATE

118 C5.H10.O.S METHYL THIOL NOR-BUTYRATE

118 C5.H10.O.S NOR-PROPYL THIOL ACETATE

118 C5.H10.O.S ISO-PROPYL THIOL ACETATE

90 C3.H6.O.S METHYL THIOL ACETATE

60 C.O.S CARBON OXYSULFIDE

64 O2.S SULFUR DIOXIDE

34 H2.S HYDROGEN SULFIDE

48 C.H4.S METHANETHIOL (METHYL MERCAPTAN)

94 C2.H6.02.S DIMETHYLSULFONE

90 CH.H6.N2.S BIS (METHYLIMINO) SULPHUR

78 C2.H6.O.S 2-MERCAPTO ETHANOL

78 C2.H6.O.S SULFOXIDE DIMETHYL

Figure 1. Part of reordered list of sulfur compounds after the SSP procedure had been applied

R1-C-S-R2

R1 = C₁-C₁₀; R2 = C₁-C₈

All these spectra were run on a Bendix TOF mass spectrometer (6). There were no thiol esters with aromatic or saturated rings of other functional groups. Thus, the classification rules are to be considered applicable only to this limited class of compounds.

The matrix array of the spectra features arranged in the linear order found by the SSP program was manually inspected, and 29 features were picked out and found to characterize alkyl thiol esters. The features consist of peaks and losses that were found to be always present or absent in the class. These 29 features were then processed against the entire 8782 spectra from the master file. Only 45 spectra, all thiol esters, out of 8782 were found to meet these 29 criteria. No additional compounds were found to meet this criteria. Thus, a rule based on these 29 features was able to separate alkyl thiol esters from any and every other class of compounds in the file. In further experimentation with these criteria, it was found that 13 of 29 features could be eliminated without finding additional spectra that met this criteria. The features eliminated were:

Peaks present: 45

Peaks absent: 51, 52, 65, 66, 80, 93, 106, 107, 108

Losses absent: 30, 31

Losses present: 89

Last, a spectrum of nor-heptyl thiol nonhexanoate thought to be a bad spectrum because it did not meet the criteria was re-run and found to meet the identification criteria derived from the cluster analysis.

Thus, the 16 features given below appear to be able to characterize straight-chain alkyl thiol esters. The criteria are:

ACKNOWLEDGMENT

We are indebted to D. Black for the TOF mass spectra. We wish to thank H. M. Fales, J. R. Slagle, M. Shapiro, and R. C. T. Lee, for thoughtful discussions and also wish to thank R. J. Feldmann for the SAIL program used in the SSP procedure.

Received for review July 23, 1973. Accepted December 26, 1973.

References

1) S. R. Heller, Chapter 8, Computer Representation and Manipulation of Chemical Information," W. T. Wipke, s R Heller, R. J. Feldmann, and E. Hyde, Ed., John Wiley, New York, N.Y., 1974.

2) P. C. Jurs, ibid., Chapter 11

(3) D. S. Smith, ibid., Chapter 12.

(4) J. R. Slagle, C L. Chang, and S. R. Heller, Ana/. Chem., 46, in press

(5) S. R. Heller, Ana/. Chem., 44, 1951 (1972).

6) W. H. McFadden, R. M. Seifert, and J. Wesserman, Ana/. Chem., 37, 560 (1965).

MW	MF	NAME
277	C9.H12.N.O5.P.S	SUMITHION
301	C10.H12.N3.O4.P.S	OXYGEN ANALOG OF GUTHION
286	C14.H10.N2.O3.S	4,6-DIPHENYL-1,2,3,5-OXATHIADIAZINE-2,2 DIOXIDE
273	C11.H12.N.O3.CL.S	CHLORMEZONE (TRANCOPAL)
333	C13.H16.N3.O4.NA.S	METHAMPRYONE (DIPYRONE POWDER ULMER)
295	C14.H18.CL.N3.S	CHLOROTHEN (TAGATHEN)
146	C8.H18.S	2,2,4,4-TETRAMETHYL-3-THIAPENTANE
146	C8.H18.S	TERT-BUTYL SULFIDE
146	C7.H14.O.S	ISO-BUTYL THIOL NOR-PROPANOATE
146	C7.H14.O.S	NOR-BUTYL THIOL NOR-PROPANOATE
146	C7.H14.O.S	ETHYL THIOL NOR-PROPANOATE
146	C7.H14.O.S	ETHYL THIOL ISO-PROPANOATE
118	C5.H10.O.S	ETHYL THIOL ACETATE
118	C5.H10.O.S	METHYL THIOL NOR-BUTYRATE
118	C5.H10.O.S	NOR-PROPYL THIOL ACETATE
118	C5.H10.O.S	ISO-PROPYL THIOL ACETATE
90	C3.H6.O.S	METHYL THIOL ACETATE
60	C.O.S	CARBON OXYSULFIDE
64	O2.S	SULFUR DIOXIDE
34	H2.S	HYDROGEN SULFIDE
48	C.H4.S	METHANETHIOL (METHYL MERCAPTAN)
94	C2.H6.02.S	DIMETHYLSULFONE
90	CH.H6.N2.S	BIS (METHYLIMINO) SULPHUR
78	C2.H6.O.S	2-MERCAPTO ETHANOL
78	C2.H6.O.S	SULFOXIDE DIMETHYL