Interpretation of Mass Spectrometry Data Using Cluster
Analysis--Alkyl Thiolesters
Stephen R. Heller1 and Chin L. Chang,
Heuristics Laboratory, National Institutes of Health, Bethesda, Md. 20014
Kenneth C. Chu
Computer Science Laboratory, Division of Computer Research and Technology, National
Institutes of Health, Bethesda, Md. 20014
1Present address, Management Information and Data Systems Division, Environmental Protection
Agency, Washington, D.C. 20460.
The interpretation of mass spectral data has followed these three main courses: Library searching (1),
Pattern Recognition (2), and Artificial Intelligence (3). These techniques have as their goal the
determination of the structure of the molecule from a given mass spectrum. We now wish to introduce
the use of another technique, cluster analysis, as an aid for the interpretation of mass spectral data.
Structure determination can be divided into several subgoals and in this presentation using cluster
analysis one subgoal is considered--namely, functional group classification.
The various library searching techniques that determine structures are "non-intelligent" in the sense
that they simply perform a task of comparing an unknown, by a variety of methods, to known
compounds in a library file and indicating possible solutions. The value of such systems is dependent
on the size of the library. Thus, as the library grows, so does the cost to store and process the data, as
well as the real and elapsed time to obtain possible answers.
In an attempt to interpret mass spectra without resorting to a large library, two main techniques have
been employed. The first, pattern recognition, involves using a data base or training set to devise a
way to interpret spectra by "teaching" the computer which peaks and losses are "good" and "bad." The
decision rules obtained from the training set are then used to "predict" the classifications of unknown
spectra.
The second approach, used in the DENDRAL project, is to program empirically derived mass spec
fragmentation rules. A list of possible structures is constructed and the spectra from these possibilities
are generated; then the spectrum which is most similar to the unknown identifies it. This technique has
been applied successfully to monofunctional acyclic amines and ethers (3).
Both these methods are based on the assumption that the classes to be studied are known. In cluster
analysis, the usual approach is to allow the method to classify data into categories or clusters of its
own making. Thus, it is sometimes referred to as "learning without a teacher" or unsupervised
learning. In addition, there are cluster analysis procedures which are "supervised," but these have not
been used here. Cluster analysis has the advantage of possibly finding new methods for understanding
old or puzzling data.
The particular cluster analysis procedure used in this presentation is a graph-theoretical method called
the shortest spanning path (SSP) (4). This procedure creates an ordered list of the sample points which
reflects the minimum path through these sample points. Each sample point has 227 components (i.e., a
227-dimensional vector in space). Starting with an ordered list of the sample points, the algorithm
iteratively reorders the list so that the resulting ordered list has the minimum sum of distances between
adjacent sample points. This collection of minimum distances represents a short path through all the
sample points, hence the name SSP. Thus, applying this procedure to the mass spectral data gives a
linear ordering of the spectra which tends to cluster them.
EXPERIMENTAL
In the particular study undertaken, the data file consisted of 323 mass spectra of compounds
containing only one sulfur atom and any other atoms in any amounts, taken from a master file of 8782
spectra. This subset file was generated using the imbedded molecular formula search routine of the
DCRT/CIS Mass Spec Search System (5). The initial file consisted of 525 compounds. However, the
file was reduced to 323 compounds by removing those spectra in the file which did not have peaks
beginning at least at m/e = 26. In addition, duplicates were not removed. The experimental spectra
consisted of the peaks from m/e = 14 to 140 and all losses from the parent ion to M - 99 which gave a
total of 227 feature points. The choice of the features selected was arbitrary, and it may be necessary
to modify the features used when attempts are made to cluster other compounds. By experimentation,
it was found that by replacing the actual intensities, with (single) weighted intensities, better clusters
appeared to be formed. The choice was quite arbitrary and might very well be expected to vary as
other classes of compounds are studied. In this work on the sulfur compounds, peaks and losses in a
spectrum with an intensity of 0.01-49% were given an intensity value of 1. Those peaks with an
intensity of 50-100% were given an intensity value of 2. Those losses with an intensity of 50 100%
were given an intensity value of 4. The programs for this work were written in FORTRAN and SAIL
(an ALGOL type language), and all were run on a DEC PDP-10 computer. The clustering program for
the 227 features of the 323 sulfur compounds required about 86K words of core and about 30 minutes
of cpu time to run.
RESULTS
After the data, consisting of the 323 sulfur compounds, were formulated into a linear path by the SSP
procedure, the data were divided into a number of linear segments by qsing the intuitive judgment of
the chemists. (In later work it is hoped to automate this manual step.) Each of these linear segments,
manually defined by a chemist, constitutes a cluster (class). From these segments, part of which is
shown in Figure 1, one class has been tentatively defined as the alkyl thiol esters, and this class, which
has been investigated in depth, will be discussed here. The master file of 8782 spectra contained 45
(actually 46, but one spectrum was found to be incorrect) monofunctional straight-chain alkyl thiol
esters of the general formula:
(TRANCOPAL) (DIPYRONE POWDER ULMER) Figure 1. Part of reordered list of sulfur compounds after the SSP procedure had been applied
O
||
R1-C-S-R2
R1 = C1-C10; R2 = C1-C8
All these spectra were run on a Bendix TOF mass spectrometer (6). There were no thiol esters with
aromatic or saturated rings of other functional groups. Thus, the classification rules are to be
considered applicable only to this limited class of compounds.
The matrix array of the spectra features arranged in the linear order found by the SSP program was
manually inspected, and 29 features were picked out and found to characterize alkyl thiol esters. The
features consist of peaks and losses that were found to be always present or absent in the class. These
29 features were then processed against the entire 8782 spectra from the master file. Only 45 spectra,
all thiol esters, out of 8782 were found to meet these 29 criteria. No additional compounds were found
to meet this criteria. Thus, a rule based on these 29 features was able to separate alkyl thiol esters from
any and every other class of compounds in the file. In further experimentation with these criteria, it
was found that 13 of 29 features could be eliminated without finding additional spectra that met this
criteria. The features eliminated were:
Peaks present: 45
Peaks absent: 51, 52, 65, 66, 80, 93, 106, 107, 108
Losses absent: 30, 31
Losses present: 89
Last, a spectrum of nor-heptyl thiol nonhexanoate thought to be a bad spectrum because it did not
meet the criteria was re-run and found to meet the identification criteria derived from the cluster
analysis.
Thus, the 16 features given below appear to be able to characterize straight-chain alkyl thiol esters.
The criteria are:
ACKNOWLEDGMENT
We are indebted to D. Black for the TOF mass spectra. We wish to thank H. M. Fales, J. R. Slagle, M.
Shapiro, and R. C. T. Lee, for thoughtful discussions and also wish to thank R. J. Feldmann for the
SAIL program used in the SSP procedure.
Received for review July 23, 1973. Accepted December 26, 1973.
References
1) S. R. Heller, Chapter 8, Computer Representation and Manipulation of Chemical Information," W.
T. Wipke, s R Heller, R. J. Feldmann, and E. Hyde, Ed., John Wiley, New York, N.Y., 1974.
2) P. C. Jurs, ibid., Chapter 11
.
(3) D. S. Smith, ibid., Chapter 12.
(4) J. R. Slagle, C L. Chang, and S. R. Heller, Ana/. Chem., 46, in press
(5) S. R. Heller, Ana/. Chem., 44, 1951 (1972).
6) W. H. McFadden, R. M. Seifert, and J. Wesserman, Ana/. Chem., 37, 560 (1965).
MW
MF
NAME
277
C9.H12.N.O5.P.S
SUMITHION
301
C10.H12.N3.O4.P.S
OXYGEN ANALOG OF GUTHION
286
C14.H10.N2.O3.S
4,6-DIPHENYL-1,2,3,5-OXATHIADIAZINE-2,2 DIOXIDE
273
C11.H12.N.O3.CL.S
CHLORMEZONE
333
C13.H16.N3.O4.NA.S
METHAMPRYONE
295
C14.H18.CL.N3.S
CHLOROTHEN (TAGATHEN)
146
C8.H18.S
2,2,4,4-TETRAMETHYL-3-THIAPENTANE
146
C8.H18.S
TERT-BUTYL SULFIDE
146
C7.H14.O.S
ISO-BUTYL THIOL NOR-PROPANOATE
146
C7.H14.O.S
NOR-BUTYL THIOL NOR-PROPANOATE
146
C7.H14.O.S
ETHYL THIOL NOR-PROPANOATE
146
C7.H14.O.S
ETHYL THIOL ISO-PROPANOATE
118
C5.H10.O.S
ETHYL THIOL ACETATE
118
C5.H10.O.S
METHYL THIOL NOR-BUTYRATE
118
C5.H10.O.S
NOR-PROPYL THIOL ACETATE
118
C5.H10.O.S
ISO-PROPYL THIOL ACETATE
90
C3.H6.O.S
METHYL THIOL ACETATE
60
C.O.S
CARBON OXYSULFIDE
64
O2.S
SULFUR DIOXIDE
34
H2.S
HYDROGEN SULFIDE
48
C.H4.S
METHANETHIOL (METHYL
MERCAPTAN)
94
C2.H6.02.S
DIMETHYLSULFONE
90
CH.H6.N2.S
BIS (METHYLIMINO) SULPHUR
78
C2.H6.O.S
2-MERCAPTO ETHANOL
78
C2.H6.O.S
SULFOXIDE DIMETHYL