Computer Techniques for Interpreting Mass Spectrometry Data
Stephen R. Heller
Division of Computer Research and Technology
National Institute of Health
Bethesda, Maryland 20014 USA
In a previous chapter, Feldmann (1) has discussed and described the structure search
component of the Chemical Information System (CIS) being developed at the National Institute
of Health (NIH) in the Division of Computer Research and Technology (DCRT). This chapter
will concern itself with the details of another component of CIS, the mass spectrometry search
(MSS) system. The philosophy behind this work is similar in many respects to Feldmann's
structure search. The techniques, methods, and facilities used in the CIS attempt to show the
current use and future potential of computers in chemical information handling and processing.
As the CIS develops it is hoped that it will provide a useful function to scientists interested in a
particular aspect of chemistry and will generate ideas for the solution of more complex problems.
There has been a great deal of research devoted to the computer handling of mass spectral
data for the purpose of structure identification. Mass spectra, which can be obtained on as little
as 10 grams of material and in quantities of hundreds of spectra hour from a typical gas
chromatography-mass spectrometry (GC/MS) computer system, have provided a very large and
fertile field for exploring methods for the computer handling of data. Mass spectra can be
obtained from a number of different types of instruments. While some variation in intensity may
be expected for this reason, the general features of the spectra are independent of the
spectrometer. The spectra that have found their way into the computer readable files that will be
discussed in the chapters at this Advanced Study Institute (ASI) are virtually all electron
ionization (ET) spectra. The spectra are all "low resolution," that is, they are measured to unit
mass value rather than to the exact mass of the atom combination. Measuring the exact mass,
often erroneously called "high resolution" mass spectrometry because it is usually carried out at
high resolution, can distinguish between molecules having N2, CO, and C2H4 groupings whose
exact mass increments are 28.0061, 27.994 and 28.031 respectively. A typical EI mass spectrum
is shown in Figure 1. The spectrum consists of the intensities of the various ions (isotopes,
fragments, rearrangement, molecular ions, etc.) against ion m/e values. The data are usually
represented as an x-y bar plot, since the peaks are constant in width.
Figure 1. A Typical EI Mass Spectrum
Four approaches to the interpretation of mass spectra will be presented at his ASI:
1. Library or File Searching
2. Cluster Analysis
3. Pattern Recognition/Learning Machine
4. Artificial Intelligence
The application of all these techniques has been described previously, except for cluster
analysis. This chapter will be devoted to discussing the first two techniques, and the following
chapters by Jurs (2) and Smith (3) will discuss the latter two methods.
The problem of computer file searching of spectral data has been recognized for some
time, in infared (IR) (4) and in mass spectrometry (5). The first attempts to handle mass spectra
data by modern computers were done in Sweden by Abrahamsson and Stenhagen in the mid
1960's (6). They developed search techniques based on the five strongest or largest peaks in a
spectrum and they further had a provision for using the molecular weight as a key for restricting
the number of possible answers from the search. They also developed the first of the so-called
"matching" or "similarity" indexes that would rate or rank the similarity of possible answers
relative to the unknown. Later workers, pointing out that the strongest or largest peaks in a
spectrum were not necessarily the most important for identification, developed a variety of other
measures, such as increasing the number of strongest peaks (7). Finally Biemann and coworkers
(8) developed the "abbreviated spectrum." This approach selects the n (where n=2, although
others use 1 or 3) largest peaks in every 14 atomic mass unit (amu) interval starting at mass 6
since this retains most of the structurally significant peaks in a spectrum because the basic
organic building block unit is CH2, i.e., a methylene group. An additional reason they chose this
approach was to reduce the storage room required for the spectra. In the case in which n=2, file
is reduced to about 24% of its original size.
With the typical double-density disks now available (and laser memories expected
shortly), this need for storage limitations becomes much less important. On the other hand their
approach did illustrate that an "abbreviated spectrum" could be used for identification. More
recently, McLafferty and coworkers (9) have extended the notion of preprocessing mass spectra
for library searching using a variety of techniques that the chemist/mass spectroscopist brings to
bear upon the problem. In addition to using a simple abbreviated file as Biemann, their Self
Training Interpretative and Retrieval (STIRS) system also uses ion series, low and high mass
characteristic ions (most abundant odd and even mass ions), small and large neutral losses and
secondary neutral losses. One significant problem with this system is that it is a sequential
search system and thus inherently slow (in real time) and non-interactive. The fixation for
storage compression has led to some very interesting studies by Grotch (10) and Isenhour (11)
regarding the intensities of the peaks in a mass spectrum. In a series of papers Grotch describes
the compression of intensity information into a few as two bits. Grotch also has worked on
compression codes for peaks; such compression undoubtedly does speed up search times.
DCR/CIS MASS SPECTRAL SEARCH (MSS)
The results of these studies, as well as some others, have been well reviewed by Ridley
(12). Since these systems have been developed on local computers for local investigators, the
main factor that they have locked has been exposure to a wide base of users, such as organic
chemist and analytical chemists, in addition to the other mass spectroscopists. The non-accessibility or unavailability of these systems has been the main point to which the NIH system
has addressed itself. The various techniques in the literature were assessed, and then we
attempted to make the most obviously useful features of existing systems directly available to all
scientists with provisions for altering or modifying searches and using other general chemical
information that might be at their disposal. In other words, a flexible, interactive conversational
mass spectral search and retrieval system was constructed. Table 1 list the options provided in
the system. As expected, the peak and intensity searches used predominately. However, if one
knows the molecular weights or partial/complete formulae, these may also be used as keys for
more efficient searching although useful homologues will be missed. With all these search
options the scientist is able to test ideas quickly and to receive virtually immediate responses,
which hopefully prod him in to the answers. As opposed to most of the previous systems which
arrive at a set of possible answers and stop, the NIH system has provided a variety of options,
including a spectrum printout directly from on-line disk storage of the data or microfiche and/or
graphics display of spectra for the very important (psychologically and experimentally) ultimate
comparison of unknown and possible solutions.
1. PEAK AND INTENSITY SEARCH
2. MOLECULAR WEIGHT SEARCH
3. MOLECULAR FORMULA SEARCH
4. MOLECULAR WEIGHT AND PEAK SEARCH
5. MOLECULAR FORMULA AND PEAK SEARCH
6. MOLECULAR WEIGHT AND MOLECULAR FORMULA SEARCH
7. DISSIMILARITY COMPARISON
8. SPECTRUM PRINTOUT
9. MICROFICHE DISPLAY OF SPECTRUM
10. DISPLAY OF SPECTRUM
11. PLOTTING OF SPECTRUM
12. CRAB - COMMENTS and COMPLAINTS
13. HARVEST - ENTERING NEW DATA
14. NEWS - NEWS OF THE SYSTEM
15. MSDC CLASSIFICATION CODE LIST
Table 1. Mass Spectral Search System Options
Before describing the file structure, the various search options will be outlined. Further
details are available elsewhere (13). The backbone of the search system is the peak and intensity
search option, which accounts, at present, for over 95% of the searches performed with the
system. The file that is being searched in this option is an abbreviated file and consists of the
two largest peaks in every 14 m/e intervals, starting at m/e of 6. Quite often this is forgotten by
the user, and, for example, peaks at 29, 30 and 31 are entered. The result, of course, is that no
spectra in the file are matched, since there are no spectra in the abbreviated file being searched
with 3 peaks in that 14 amu interval. On the other hand, m/e 33, 34 and 35 would be quite
acceptable. Using the three largest peaks in the 14 amu interval would increase the file size
somewhat, but might help those that have a tendency to make the above error. Owing to
variations in instruments (e.g. magnetic, electric, quadrupole), it is necessary to have an intensity
range factor in the peak search. By allowing the factor to be varied by the user, he can "get a
feeling" for the file and how intensities do or do not affect a search for his particular class of
compounds. This is especially necessary because of out group of users. It has been found that
very effective searching can be done with a very few (two to four) well-chosen peaks, almost
independent of the intensity range decided upon. Figure 2 is a typical peak and intensity search.
The opportunity for a chemist to interact with the system (and see how a search is proceeding) is,
we feel, fundamental in the success of this system, Feldmann's structure search and the synthesis
programs of Corey (14) and Wipke (15).
The search option used next most often is the molecular weight (MW) search which
simply provides a list of names, molecular formulae, and identification numbers (ID#'s used for
other options within the system) for those compounds in the file with the specified MW. This
option is particularly useful in Chemical Ionization (CI), Field Ionization (FI) and Field
Desorption (FD) spectrometry where the molecular weight is easily deduced.
Figure 2. Example of a Peak Search
Figure 3 shows an example of a MW search for a molecular weight of 103. Another search
option used almost as often as the MW search is the molecular formula (MF) search option,
consisting of two parts. The complete MF part allows one to search for a complete molecular
formula and provides the same information output as the MW search. Figure 4 shows a typical
complete MF search. The imbedded MF search allows one to search for spectra in the file using
only part of the molecular formula. This option uses the individual atom type and its magnitude
in the molecule. Thus, one can search for Br2 or C6, but not (as yet) for any Br compounds up to
and including Br2 or any number of carbon atoms up to or including C6. Figure 5 shows a
search for the same formula as shown in Figure 4, except by allowing the formula to be part of a
larger molecule, more answers are found. These three searches (peak/intensity, MW and MF)
have also been combined to allow for a molecular formula and molecular weight search, a peak
and molecular formula search and a peak and molecular weight search. These combined
searches are extremely effective in taking a minimal amount of information and reducing the
possibilities in the file to usually less than a half dozen with just a MW or MF and a peak.
TYPE MOLECULAR WEIGHT
CR TO EXIT
PROGRAM: FOUND 8 SPECTRA WITH MW: 103
PROGRAM: PRODUCE REFERENCES YES OR NO?
Figure 3. Example of a Molecular Weight Search
Figure 4. Example of a Complete Molecular Formula Search
Figure 5. Example of an Imbedded Molecular Formula Search
Figures 6 and 7 are examples of two of these combined search options. In the future, when the
data is available, it is planned to add to the system Feldmann's nested structure searching, the
Mass Spectrometry Data Centre's Classification Code scheme, and the references to the spectra
(e.g. type of instrument, conditions, literature references).
Figure 6. Example of a Peak and Formula Search
Figure 7. Example of a Peak and Molecular Weight Search
Once the user has a possible answer, he has the option of quantitatively comparing it with
his unknown using the dissimilarity index (DI) option. This option takes the intensity of a file
compound and the unknown at each m/e value from 12-500 (or any smaller user-specified range)
and makes a comparison derived from a Euclidean geometry distance measurement. The DI in
the NIH system also allows for comparison of two files spectra, which is very useful in checking
for duplicates in the file. While this is an objective measure for comparison, mass spectrometry
structure identification is still, we feel, basically subjective. Thus, we have provided a number of
options that allow the user to judge for himself the quality of his answer. The first and simplest
is the spectrum printout option, which simply types out all (or just a portion, if so specified) of
the peaks and intensities in a spectrum.
Figure 8. Example of a Spectrum Printout
Figure 8 shows a typical spectrum printout. Alternatively, with a graphics terminal, such as the
GT-40 or Tektronix 4010 being demonstrated at this ASI, the display option plots the data in a
standard x-y bar plot. Lastly, the plot option allows the same data that has been on the display to
be plotted on either a Caclomp or Zeta Plotter. The mass spectrum plot in Figure 1 was obtained
using this option. The data base has also processed by a COM (Computer Output Microfiche)
devise and it is available on microfiche, either using a manual or an automatic (and computer
drive) viewer. Figure 9 shows the computer driven microfiche viewer next to the terminal user
to drive the viewer. The image on the screen is one of 192 spectra on each fiche. The carousel in
which the fiche reside inside the viewer requires a four second maximum access time for
viewing. In addition to these options, there are miscellaneous options for entering data,
commenting about the system and receiving news of the system. Lastly, a fragment loss search
option is currently being added, which will allow for the searching of neutral loss from-1 (i.e. the
parent ion (P) +1 peak) to P-100. Preliminary work in using this feature as an additional tool for
structure elucidation has been very promising.
Figure 9. Computer Driven Microfiche Viewer
Now that the system has been outlined, some description of the file structure is in order
so that one may understand how and why the system operates. The specific examples are for a
DEC PDP-10 computer, but the principles are applicable to any modern timesharing computer
system. In fact, the system has recently been transferred to the GE-635 computer with a minor
number of problems. Previous workers have considered that the search and retrieval system
involved two problem areas--the representation of the data and the search algorithm. Actually,
there is only one problem; the data representation and search algorithm are virtually
indistinguishable and inseparable--if they are to be good. The file structure used in the mass
spectral search is similar to Feldmann's file structure in that the method of searching is dictated
by the file structure and is the reason for the interactive capability.
All the files in the mass spectral search system can be viewed as one of two types--one
containing numeric search data (peaks, molecular weights, etc.) and one containing character
search data (molecular formulae). If the data is binary or ASCII, packed or unpacked, these
formats are just minor modifications of the basic file structure.
The abbreviated peak file consists of the identification number (ID#), peak, intensity, and
molecular weight (MW), sorted by peak and ID# respectively. The file generation program takes
this file and creates a pointer file which indicates whether or not there are any peaks at that mass
and if so, where in the reference file the ID#, intensity, and MW are stored. The ID#, intensity
and MW are all packed into one 36-bit word, 18 bits for the ID# (allowing for 2 or 262,143
spectra in the file), 7 bits for the intensity (allowing for the usual 0-100% intensity range), and 11
bits for the molecular weight (allowing for a MW up to 2048). When a search is performed, the
value of peak is used to go to that word or cell in one of the 128-word blocks of the pointer disk
file and if a non -1 number is found (-1 indicating no reference for that molecular weight), that
number is the starting point for the ID#, intensity, and MW in the reference file. (The 128-word
size for the block is for the PDP-10 computer disk block size. The GE-635 computer uses a 315
work block and so the file on the GE system contains the same information, but located in
different blocks). The program then simply looks up this starting point (subtracting it from the
starting point of the next mass will give the number of references for this mass), unpacks the
intensity, and checks to see if it is in the allowable range. If it is, then the ID# is written out onto
a scratch disk file to be used later for further search comparisons or to look up the name
associated with the ID#. The molecular weight could readily be made a search parameter by
allowing for a MW range and unpacking here and checking it in addition to the intensity. Now,
when there is a second, third (and so on) search, the same basic look-up is done and a Boolean
AND logic operation is performed between two lists of ID#'s. Figure 3 shows a typical MW of
103. Tables 2 and 3 are the pointer and reference files respectively for this MW search. These
files were used since they are ASCII, rather than binary (which is the case for the peak reference,
molecular formula pointer, and name reference files). The MW of 103 is used as a pointer to the
103rd word or cell in the 128-word block, which contains the pointer number 1563. Subtracting
this form the 104th word -1, gives 1572-1563-1=8, which is the answer given in Figure 3. One
then goes to block 1563/128 + 1=13 (with a remainder of 27) in the reference file and the 27th
word/cell is, indeed, 809. The module function was used to calculate this value. The following
seven numbers are the ID numbers shown in Figure 3. These ID# are then used in the identical
name disk file look-up to go to the name pointer and reference files. One first uses the ID# to
locate the starting point and length (by subtraction from the starting point of the next sequential
ID#) of the name. The name, stored as five 7-bit binary characters per 36-bit word, is then
printed out. The spectrum printout option works in the same manner, but in this case, the peak
and intensity are stored in binary form, 18 bits each in one 36-bit word.
In the case of molecular formulae (MF), a different approach is necessary. Molecular
formulae are character strings and do not have any particular numerical value associated with
them. Thus, to be handled in the computer, it is necessary to convert the molecular formula
string into a number or location in the disk pointer file. One method to use in "converting" a
molecular formula to numeric key pointer value is called hash coding (16). In this procedure one
processes the molecular formula character string through a function which produces a number.
Clearly the function must be reasonably well chosen so that two different strings will rarely
produce the same number and when such a "collision" occurs it can be handled with a minimum
of difficulty. Now that one has converted the molecular formula string into a numeric key-value,
this number is used to address the appropriate pointer file block. In this case, the block is
divided into two-word cells as opposed to one-word cells for the previous file structure. This is
necessary because one cannot use the hashed coded numeric key valued as both and indicator to
the pointer file and an indicator to the reference file. The second word in each cell of the pointer
file is used as the pointer to the starting position of the ID#'s in the molecular formula reference
files. Tables 4 and 5 are examples of the molecular formula pointer and reference files. Tables 4
and 5 are examples of the molecular formula pointer and reference files. If one takes the first
molecular formula in the file, Ar (for the inert gas Argon) and processes it through the hash
function, the result is 4682565. The number, 4682565 is found in block 14 of the pointer file,
cell #7 (word #13). In the second part of that cell is the starting address position of the
references and the number of references to Argon are contained in the first word of the reference
file. Indeed, the second word is 1, which is where the first reference would be expected to be and
Table 4 shows that the first word of Block 1 is 5, the number of Argon references plus 1. The
next four words are ID#'s for the Argon spectra. The reason for the number 5, instead of 4 is to
take into account the space (i.e. one extra word) for the number of references. This, then is how
the complete and partial molecular formula searches are performed.
In summary, the file structure attempts to maximize the use of the direct access disk and
the 36-bit word length of the PDP-10. If one had to use a computer with a different word size,
the system would, of course, still work, but would require some minor changes in the sizes of the
cells to pack the peak, intensity and molecular weight.
Table 2. Molecular Weight Pointer File
Table 3. Molecular Weight Reference File
This question of using smaller machines (that is, the so-called "mini-computer") is worth
looking at from the standpoints of ease of use, value, or utility and its probable role in the future.
At present, mini's are used for lab automation and often to perform numerous other tasks, but the
direction in which technology is heading is towards mini-processors, which are cheaper and
better to use. While search systems utilizing mini-computers have been well designed and might
be viewed at first glace as a very effective alternative to the central search system described, here
a few points are worth noting. First, the great variety of mini-computers with incompatible
software makes use of available programs difficult. Second, assuming one did have such a mini
computer available in his laboratory, it could be used at one time either for data acquisition or
data analysis - not for both. As more instrumentation is computerized, there will probably be
less time on a mini in the lab available for analysis/retrieval.
Table 4. Molecular Formula Pointer File
Table 5. Molecular Formula Reference File
In addition to the problem of getting time on the lab computer, there is the initial expense of the
data base, the time/cost for processing it for the search program. Then there is the time/cost for
correcting errors in the data and then reprocessing the data base. Along with this, there is the
time/cost of obtaining new data and processing it for the file.
Let us now assume that the potential routine users of such a system have been satisfied that a central system is the right direction in which to go, considering the time, costs, and lack of manpower expertise to perform the task. What of the researcher, the scientist who would like to experiment with the data base, either for alternative retrial methods or other types of data analysis (e.g., cluster analysis or pattern recognition)? At no cost to him, he has a very large data base available on-line and one that someone else will maintain, correct and augment. With a versatile search system or a program of his own, he may extract those parts of the data base of interest to his research and work with only that portion. Having a large and powerful computer system at his disposal, it is much easier to write in a higher level programming language and easier to perform experiments. Thus, the central search facility approach appears to offer a valuable service to the researcher also.
LIBRARY SEARCH SYSTEM SUMMARY
As Grotch (10), Isenhour (11), and others have stated, the most desirable attributes of a
search system are generally thought to be:
1. Ease of use
4. Low cost
The MSS system fits these four criteria very well. Its ease of use and availability are
evident from its use (25 sessions per day) and the size of the user community (over 200 users).
Its effectiveness or efficiency is good and is essentially the same as most other systems'. Lastly,
its overall low cost is also quite obvious. These attributes led to the system's being transferred to
the GE network. A demonstration of the search system described here will be available to ASI
participants from the GE node on the network here in The Hague, The Netherlands. The
management of and responsibility for the system is now being handle by the Mass Epec Data
Centre (MSDC) in Aldermaston, England. In this way the value and use of the system to the
world-wide mass spectrometry community can be tested and evaluated and the data base made
available to research workers in the field.
Undoubtedly these concepts will take time to develop and evolve. Time-sharing systems
are still in their infancy (17). At present, there is only one large international network (the
GE/Honeywell network), and it goes only from Tokyo to Rome, leaving a large segment of the
world outside the system. A few other companies are just starting to provide international
networks, while most are regional (i.e., -serve a dozen large states and a few dozen large cities in
the U.S. or perhaps a few cities in England). One major goal of this presentation is to show that
research efforts devoted to developing systems (programs, data bases, etc.) that can readily and
easily be made available over telephone lines to computer networks are of value to the scientific
community. Using such a facility will enable more research time and effort to be devoted to the
actual research rather than the presentations, program rewriting, data base reformatting, etc., that
are so prevalent today.
An example of how this large file of mass spectral data being stored continuously on-line
can be used for research will now be illustrated using a technique called cluster analysis which is
usually considered a part of pattern recognition. Cluster analysis is concerned with finding
homogeneous sets of data points within a larger set of data. Within cluster analysis there are a
number of approaches. Only one will be used here. Further details are available in a number of
books and articles (18). Cluster analysis is a method that classifies, associates or clusters a set of
data into a number of natural sets containing similar data. It differs, generally speaking, from the
pattern recognition techniques to be described later by Jurs, in that cluster analysis is usually
non-supervised learning and not limited (i.e., the number of clusters are not known). Pattern
recognition methods require pre-classification or pre-labeling of the data and a predetermined
number of classes into which the data can fall. However, cluster analysis allows the data to fall
into whatever classes the particular clustering techniques generate; i.e., the chips fall where they
may. It should be pointed out that there are, however, situations in which the cluster analysis is
unsupervised but limited. By this it is meant that while the data points are un-labeled, the
number of classes into which they may fall is limited. This area will not be dealt with here, but
rather an example of using the unlimited and non-supervised mode will be presented since it has
a great dealt of potential value in chemistry in that it may enable the scientist to gain insight into
the nature of the data he is analyzing. Such approaches have found value in analyzing data for
taxonomy classification, economics, geography, and automatic generation of the thesauri (19).
Clustering opens up the possibility of discovering new facts about old data or puzzling data.
There are a number of clustering techniques available. Kowalski and Bender have
discussed the Q-mode clustering method and non-linear mapping (NLM) as two techniques for
separating data where there are only a few clusters (20). This latter technique reduces the n-dimensional data to 2 dimensions so that it may be visualized as x, y points and was used to
perform the recognition of the classes defined by the Q-mode method. The NLM method
necessarily introduces some distortion into the final picture/display in two dimensions, but the
guiding principal behind the mapping dictates that the close distances are preserved, and this in
turn preserves the clustering of the data. NLM suffers from the disadvantage that the mapping
produced can often be too dense for a good display. Another type of cluster analysis procedure is
the use of graph theory methods and Zahn (12), in a comprehensive review of this subject,
describes a number of algorithms for clustering based on the so-called "minimal spanning tree"
Data point sets come in a variety of patterns. Figure 10-12 show three sample problem
data sets that cluster analysis has been used to solve. Figure 10 is the trivial case of two distinct
clusters which can be readily separated. However, if one looks closely, on of the two sets of ten
points can probably be separated into two smaller sets of five points. Figure 11 shows a more
complicated case of a set of points attached via a short "neck" and is more difficult to separate.
Lastly, Figure 12 shows a possible mass spectral data problem. In Figure 12 one can readily find
anywhere from five to eleven (or more) clusters, depending on what criteria one uses for
"similarity." Ball (18) has discussed a number of measures of similarity. They include the dot
product, similarity radio, weighted and unweighted Euclidean distances and weighted and
unweighted Boolean "AND" normalized correlation methods.
Figure 10. Two Separate Clusters
Figure 11. Two Clusters Joined at a Neck
Figure 12. Multi-Cluster Pattern
Figure 13 illustrates one of the graph-theory methods, the shortest spanning path (SSP)
cluster procedure (22). A simple distance measurement to find the two clusters in the data set
(steps A-D). In part A, one has the starting point set. One then connects all pairs of points as
shown in part B. In step C one obtains the SSP. In part D, the "inconsistant" edge J-K has been
deleted since it has a larger distance weight than any other neighboring edges and hence the data
points are clustered. If one now considers the total weight of edges in a graph as the sum of all
the constituent edges, the minimum or shortest spanning path (SSP) in this graph is a spanning
path whose weight is a minimum among all the spanning paths of this graph.
Figure 13. SSP (Shortest Spanning Path) Cluster Procedure
The SSP method can also be used in data reorganization, that is, the data can be
rearranged according to the "shortest spanning path" for the data. We have used the SSP method
to rearrange parts of the mass spectrometry data base (23). It has been used because it is
available, is able to handle large amounts of data, and allows for a convenient presentation of the
clusteral data. No comparisons with other clustering techniques have been made using this mass
spectral data, so that the relative efficiency of the SSP method cannot be indicated. Since finding
an absolute minimum path with a large number of data points is time consuming, the SSP
method uses an approximation technique which finds a relative minimum path. This procedure
is a far less time consuming process and is able to handle large amounts of data. One very
serious limitation to most previous clustering methods has been their inability to handle realistic
amounts of real experimental data. While the SSP does have a favorable side to it, one obvious
potential problem with the SSP method is shown in Figure 14, which is very similar to the data
point samples in Figure 11. Anyone looking at this data will readily conclude there are two
clusters, but the closeness of the sum of the point near the neck of the two clusters and the SSP
technique of finding the closest neighbors from any given pairs of points would lead to the path
shown in the diagram. If one labels one cluster A and the other B, then the resulting reorganized
data will not be A and B, but rather BAB. Thus, one must be cautious in using the results
directly and automatically.
Figure 14. The SSP Through a Set of Clusters
The compounds used in the sample clustering experiment were molecules containing one
sulfur atom and any other element (in any combination and/or amounts) taken from the MSS data
base. Using the imbedded molecular formula search for compounds with one sulfur atom, 525
answers were found. Because some of spectra do not have peaks starting at mass 26, these were
discarded and 323 remained. Duplicates were not removed. The data was first used as is,
without any weighting of the intensities. The peaks from m/e=14 to 140 and losses from 0 to -99
(i.e. Parent to Parent -99 amu) were considered as features for a total of 227 features. The losses
were calculated from the known molecular weights of the compounds. The data was later
weighted since it was found to produce better clustering. The weighting scale assigned a value
of 1 to all peaks and losses with an intensity less than 10%, 2 to all peaks greater than 10%, and 4
to all losses greater than 10%. The matrix output consists of the sorted list of 323 rows of sulfur
compounds (sorted according to the order of the SSP) and the 227 columns of features used to
classify them. The computer program used has the ability to "sort" the 227 features as well as
the compounds, but this feature analysis was discarded owing to the computer time necessary to
perform this clustering and the relative lack of information that it seems to impart at this early
stage of analysis. Without the sort feature, the SSP program required 86K words and about 30
minutes of cpu time (on a PDP-10) to cluster the 323 sulfur compounds. Work is underway to
reduce both these factors.
We have analyzed the resulting matrix output, defining clusters partly based on the results
of the SSP sorted list and partly on chemical intuition. This was done because the potential
problem in Figure 14 appeared to occur in this case. Table 6 contains part of the sorted list of a
sample run showing part of the sulfur cluster. It is fairly clear that some groups exist, such as:
sulfur-halogen, sulfur-phosphorus, small compound, C2 sulfur compounds, etc. Table 7 contains
a partial list of the clusters that have been tentatively identified from this method. These
identifications were made with relatively little difficulty because of the matrix output
representation that the SSP method produces. This output, which indicates which features
(peaks, losses) are used for the clustering, distinguishes the SSP method from most other
clustering techniques and is a value asset in the analysis. The true test of this technique will
come as more data is made available and tested based on the results found from this small class
of sulfur data.
Table 6. Part of the Reorganized List of Sulfur Compounds
2. THIAZOLES, BENZOTHIAZOLES
3. AROMATIC SULFUR
6. SULFITES AND SULFATES
7. BICYCLIC SULFUR
8. THIOL ESTERS
9. ISOTHIOCYANATES, THIOCYANATES
10. C6-C8 THIOLS, SULFIDES
11. C5-C6 CYCLIC SATURATED THIOLS, SULFIDES
Table 7. Partial List of Clustered Groups
However, even now, the peaks and losses that the SSP clustering technique has shown to be
characteristic for a group such as the monofunctional straight-chain alkyl thiol esters (RCOSR)
have shown that this approach offers a great deal of potential for interpretating mass spectral
data. This approach of recognition and classification of (functional) groups may lead to new
empirical rules for structural elucidation.
I wish to express my deep appreciation to Dr. Henry M. Fales for his kindness and help
with this work and his support of the search system. I also wish to thank Dr. Arnold W. Pratt,
Director, Division of Computer Research and Technology for his foresight, continued support
and interest in the computer applications work presented here.
1. R. J. Feldmann, Chapter 3, "Proceedings of the NATO/CNAASI on computer
Representation and Manipulation of Chemical Information" edited by W. T. Wipke, S. R.
Heller, R. J. Feldmann, and E. Hyde, John Wiley, New York (1973).
2. P. C. Jurs, Ibid, Chapter 11
3. D. Smith, Ibid, Chapter 12
4. (a) D. H. Anderson and G. L. Covert, Anal. Chem., 39, 39, 1288 (1967).
(b) D. S. Erley, Ibid, 40, 894 (1968).
(c,d) F. E. Lytle, Ibid, 42, 355, 1532 (1970).
(e) P. C. Jurs, Ibid, 43, 364, (1971).
(f) D. S. Erley, Appl.. Spec., 25, 200 (1971).
5. P. D. Zemany, Anal. Chem., 22, 950 (1950).
6. (a) S. Abrahamsson, S. Stenhagen and E. Stenhagen, Biochem. J., 92, (1964).
(b) S. Abrahamsson, Science Tools, 14, 129 (1967).
7. (a,b) V. L. . Talrose, V. V. Baznikow and G. D. Tantsyrev, Dokl, Akad. Nauk SSR, 159, 182 (1964); 170, 379, (1996).
( ) Petterson and R. Rayhage, Arkly. Kemi, 26, 293 (1967).
(d) B. A. Knock, I. C. Smith, D. E. Wright and R. G. Ridley, Ibid, 42, 1516 (1970).
(e) L. R. Crowford and J. D. Morrison, Anal. Chem., 40, 1464 (1968).
8. H. S. Herz, R. A. Hites, and K. Biemann, Anal. Chem., 43, 681 (1971).
9. (a) F. W. McLafferty, Pure Appl.. Chem., 7, 61 (1971).
(b) K. Kwok, R. Venkataraghavan and F. W. McLafferty, J. Amer. Chem. Soc., 95,
10. (a) S. L. Grotch, Anal. Chem., 42, (1970).
(b) S. L. Grotch, Ibid, 43, 1362 (1971).
© S. L. Grotch, Ibid, 45, 2 (1973).
11. L. E. Wangen, W. S. Woodward, and T. L. Isenhour, Ibid, 43, 1605 (1971).
12. R. G. Ridley, Chapter 6, "Compound Identification by Computer Matching Mass Spectrometry," edited by G. R. Waller, John Wiley, (1971), and references therein.
13. (a) S. R. Heller, Anal. Chem., 44, 1951 (1972).
(b) S. R. Heller, H. M. Fales, and G. W. A.Milne, J. Chem. ed., 49, 725 (1972).
© S. R. Heller, H. M. Fales, and G. W. A. Milne, Org.. Mass Spec., 7, 107 (1973).
(d) S. R. Heller, D. A. Koniver, H. M. Fales, and G. W. A. Milne, Anal. Chem., in press.
(e) S. R. Heller, R. J. Feldmann, H. M. Fales, and G. W. A. Milne, J. Chem. Doc.,
14. (a) E. J. Corey and W. T. Wipke, Science, 166, 178 (1969)
(b) E. J. Corey, Quarterly Rev. 25, 455 (1971).
(c,d) E. J. Corey, W. T. Wipke, R. D. Cramer, and W. J. Howe, J. Amer. Chem, Soc., 94,
421, 431 (1972).
(e) E. J. Corey, R. D. Cramer and W. J. Howe, Ibid, 94, 440 (1972).
(f) E. J. Corey and G. A. Petterson, Ibid, 94, 460 (1972).
15. W. T. Wipke, Chapter 7, "Proceedings of the NATO/CNA ASI on computer Representation and Manipulation of Chemical Information" edited by W. T. Wipke, S. R. Heller, R. J. Feldmann, and E. Hyde, John Wiley, New York (1973).
16. R. Morris, Comm.. ACM., 11, 34 (1968).
17. Barrons, "Time Sharing Marches On," Barrons, January 15, 1973, p. 3.
18. (a) R. C. Tryson, "Cluster Analysis, " Edwards Bros. Inc. Ann Arbor, Michigan (1939).
(b) R. R. Sokal and P. H. A. Sheath, "Principles of Numerical Taxonomy, " Freeman, San Francisco (1963).
© N. J. Nilsson, "Learning Machines, " McGraw-Hill, New York (1965).
(d) G. H. Ball, Proc. AFIPS 1965 Fall Joint Comp.. Conf.., 27, Part 1, 533 (1965).
(e) G. Nagy, Proc. IEEE, 56, 836 (1968).
(f) R. F. Ling, Ph.D. Thesis, Yale University (1971).
(g) N. Jardin and R. Gibson, "Mathematical Taxonomy, " John Wiley (1971).
(h) R. Duda and P. Hart, "Pattern Classification and Scene Analysis, " John Wiley (1972).
(I) W. S. Meisel, "Computer-Oriented Approaches to Pattern Recognition," Chapter 8,
Academic Press New York (1972).
19. J. G. Auguston and J. Minker, J. Assoc. Comp.. Mach., 17, 571 (1970).
20. B. R. Kowalski and C. F. Bender, J. Amer. Chem. Soc., 94, 5632 (1972) and 95. 686
21. C. T. Zahn, IEEE, Trans. on Computers, C-20, G8 (1971).
22. J. R. Slagle, C. L. Chang, and S. R. Heller, submitted for publication.
23. S. R. Heller, C. L. Chang, K.C. Chu, Anal. Chem., in press.