Computer Techniques for Interpreting Mass Spectrometry Data

Stephen R. Heller

Division of Computer Research and Technology

National Institute of Health

Bethesda, Maryland 20014 USA

INTRODUCTION

In a previous chapter, Feldmann (1) has discussed and described the structure search component of the Chemical Information System (CIS) being developed at the National Institute of Health (NIH) in the Division of Computer Research and Technology (DCRT). This chapter will concern itself with the details of another component of CIS, the mass spectrometry search (MSS) system. The philosophy behind this work is similar in many respects to Feldmann's structure search. The techniques, methods, and facilities used in the CIS attempt to show the current use and future potential of computers in chemical information handling and processing. As the CIS develops it is hoped that it will provide a useful function to scientists interested in a particular aspect of chemistry and will generate ideas for the solution of more complex problems.

There has been a great deal of research devoted to the computer handling of mass spectral data for the purpose of structure identification. Mass spectra, which can be obtained on as little as 10 grams of material and in quantities of hundreds of spectra hour from a typical gas chromatography-mass spectrometry (GC/MS) computer system, have provided a very large and fertile field for exploring methods for the computer handling of data. Mass spectra can be obtained from a number of different types of instruments. While some variation in intensity may be expected for this reason, the general features of the spectra are independent of the spectrometer. The spectra that have found their way into the computer readable files that will be discussed in the chapters at this Advanced Study Institute (ASI) are virtually all electron ionization (ET) spectra. The spectra are all "low resolution," that is, they are measured to unit mass value rather than to the exact mass of the atom combination. Measuring the exact mass, often erroneously called "high resolution" mass spectrometry because it is usually carried out at high resolution, can distinguish between molecules having N2, CO, and C2H4 groupings whose exact mass increments are 28.0061, 27.994 and 28.031 respectively. A typical EI mass spectrum is shown in Figure 1. The spectrum consists of the intensities of the various ions (isotopes, fragments, rearrangement, molecular ions, etc.) against ion m/e values. The data are usually represented as an x-y bar plot, since the peaks are constant in width.

Figure 1. A Typical EI Mass Spectrum

Four approaches to the interpretation of mass spectra will be presented at his ASI:

1. Library or File Searching

2. Cluster Analysis

3. Pattern Recognition/Learning Machine

4. Artificial Intelligence

The application of all these techniques has been described previously, except for cluster analysis. This chapter will be devoted to discussing the first two techniques, and the following chapters by Jurs (2) and Smith (3) will discuss the latter two methods.

The problem of computer file searching of spectral data has been recognized for some time, in infared (IR) (4) and in mass spectrometry (5). The first attempts to handle mass spectra data by modern computers were done in Sweden by Abrahamsson and Stenhagen in the mid 1960's (6). They developed search techniques based on the five strongest or largest peaks in a spectrum and they further had a provision for using the molecular weight as a key for restricting the number of possible answers from the search. They also developed the first of the so-called "matching" or "similarity" indexes that would rate or rank the similarity of possible answers relative to the unknown. Later workers, pointing out that the strongest or largest peaks in a spectrum were not necessarily the most important for identification, developed a variety of other measures, such as increasing the number of strongest peaks (7). Finally Biemann and coworkers (8) developed the "abbreviated spectrum." This approach selects the n (where n=2, although others use 1 or 3) largest peaks in every 14 atomic mass unit (amu) interval starting at mass 6 since this retains most of the structurally significant peaks in a spectrum because the basic organic building block unit is CH2, i.e., a methylene group. An additional reason they chose this approach was to reduce the storage room required for the spectra. In the case in which n=2, file is reduced to about 24% of its original size.

With the typical double-density disks now available (and laser memories expected shortly), this need for storage limitations becomes much less important. On the other hand their approach did illustrate that an "abbreviated spectrum" could be used for identification. More recently, McLafferty and coworkers (9) have extended the notion of preprocessing mass spectra for library searching using a variety of techniques that the chemist/mass spectroscopist brings to bear upon the problem. In addition to using a simple abbreviated file as Biemann, their Self Training Interpretative and Retrieval (STIRS) system also uses ion series, low and high mass characteristic ions (most abundant odd and even mass ions), small and large neutral losses and secondary neutral losses. One significant problem with this system is that it is a sequential search system and thus inherently slow (in real time) and non-interactive. The fixation for storage compression has led to some very interesting studies by Grotch (10) and Isenhour (11) regarding the intensities of the peaks in a mass spectrum. In a series of papers Grotch describes the compression of intensity information into a few as two bits. Grotch also has worked on compression codes for peaks; such compression undoubtedly does speed up search times.

DCR/CIS MASS SPECTRAL SEARCH (MSS)

The results of these studies, as well as some others, have been well reviewed by Ridley (12). Since these systems have been developed on local computers for local investigators, the main factor that they have locked has been exposure to a wide base of users, such as organic chemist and analytical chemists, in addition to the other mass spectroscopists. The non-accessibility or unavailability of these systems has been the main point to which the NIH system has addressed itself. The various techniques in the literature were assessed, and then we attempted to make the most obviously useful features of existing systems directly available to all scientists with provisions for altering or modifying searches and using other general chemical information that might be at their disposal. In other words, a flexible, interactive conversational mass spectral search and retrieval system was constructed. Table 1 list the options provided in the system. As expected, the peak and intensity searches used predominately. However, if one knows the molecular weights or partial/complete formulae, these may also be used as keys for more efficient searching although useful homologues will be missed. With all these search options the scientist is able to test ideas quickly and to receive virtually immediate responses, which hopefully prod him in to the answers. As opposed to most of the previous systems which arrive at a set of possible answers and stop, the NIH system has provided a variety of options, including a spectrum printout directly from on-line disk storage of the data or microfiche and/or graphics display of spectra for the very important (psychologically and experimentally) ultimate comparison of unknown and possible solutions.

DCRT/CIS

MASS SPECTRAL SEARCH SYSTEM

1. PEAK AND INTENSITY SEARCH

2. MOLECULAR WEIGHT SEARCH

3. MOLECULAR FORMULA SEARCH

a. Complete

b. Imbedded

4. MOLECULAR WEIGHT AND PEAK SEARCH

5. MOLECULAR FORMULA AND PEAK SEARCH

6. MOLECULAR WEIGHT AND MOLECULAR FORMULA SEARCH

7. DISSIMILARITY COMPARISON

8. SPECTRUM PRINTOUT

9. MICROFICHE DISPLAY OF SPECTRUM

10. DISPLAY OF SPECTRUM

11. PLOTTING OF SPECTRUM

12. CRAB - COMMENTS and COMPLAINTS

13. HARVEST - ENTERING NEW DATA

14. NEWS - NEWS OF THE SYSTEM

15. MSDC CLASSIFICATION CODE LIST

Table 1. Mass Spectral Search System Options

Before describing the file structure, the various search options will be outlined. Further details are available elsewhere (13). The backbone of the search system is the peak and intensity search option, which accounts, at present, for over 95% of the searches performed with the system. The file that is being searched in this option is an abbreviated file and consists of the two largest peaks in every 14 m/e intervals, starting at m/e of 6. Quite often this is forgotten by the user, and, for example, peaks at 29, 30 and 31 are entered. The result, of course, is that no spectra in the file are matched, since there are no spectra in the abbreviated file being searched with 3 peaks in that 14 amu interval. On the other hand, m/e 33, 34 and 35 would be quite acceptable. Using the three largest peaks in the 14 amu interval would increase the file size somewhat, but might help those that have a tendency to make the above error. Owing to variations in instruments (e.g. magnetic, electric, quadrupole), it is necessary to have an intensity range factor in the peak search. By allowing the factor to be varied by the user, he can "get a feeling" for the file and how intensities do or do not affect a search for his particular class of compounds. This is especially necessary because of out group of users. It has been found that very effective searching can be done with a very few (two to four) well-chosen peaks, almost independent of the intensity range decided upon. Figure 2 is a typical peak and intensity search. The opportunity for a chemist to interact with the system (and see how a search is proceeding) is, we feel, fundamental in the success of this system, Feldmann's structure search and the synthesis programs of Corey (14) and Wipke (15).

The search option used next most often is the molecular weight (MW) search which simply provides a list of names, molecular formulae, and identification numbers (ID#'s used for other options within the system) for those compounds in the file with the specified MW. This option is particularly useful in Chemical Ionization (CI), Field Ionization (FI) and Field Desorption (FD) spectrometry where the molecular weight is easily deduced.

Figure 2. Example of a Peak Search

Figure 3 shows an example of a MW search for a molecular weight of 103. Another search option used almost as often as the MW search is the molecular formula (MF) search option, consisting of two parts. The complete MF part allows one to search for a complete molecular formula and provides the same information output as the MW search. Figure 4 shows a typical complete MF search. The imbedded MF search allows one to search for spectra in the file using only part of the molecular formula. This option uses the individual atom type and its magnitude in the molecule. Thus, one can search for Br2 or C6, but not (as yet) for any Br compounds up to and including Br2 or any number of carbon atoms up to or including C6. Figure 5 shows a search for the same formula as shown in Figure 4, except by allowing the formula to be part of a larger molecule, more answers are found. These three searches (peak/intensity, MW and MF) have also been combined to allow for a molecular formula and molecular weight search, a peak and molecular formula search and a peak and molecular weight search. These combined searches are extremely effective in taking a minimal amount of information and reducing the possibilities in the file to usually less than a half dozen with just a MW or MF and a peak.

TYPE MOLECULAR WEIGHT

CR TO EXIT

USER: 103

PROGRAM: FOUND 8 SPECTRA WITH MW: 103

PROGRAM: PRODUCE REFERENCES YES OR NO?

USER: YES

Figure 3. Example of a Molecular Weight Search

Figure 4. Example of a Complete Molecular Formula Search

Figure 5. Example of an Imbedded Molecular Formula Search

Figures 6 and 7 are examples of two of these combined search options. In the future, when the data is available, it is planned to add to the system Feldmann's nested structure searching, the Mass Spectrometry Data Centre's Classification Code scheme, and the references to the spectra (e.g. type of instrument, conditions, literature references).

Figure 6. Example of a Peak and Formula Search

Figure 7. Example of a Peak and Molecular Weight Search

Once the user has a possible answer, he has the option of quantitatively comparing it with his unknown using the dissimilarity index (DI) option. This option takes the intensity of a file compound and the unknown at each m/e value from 12-500 (or any smaller user-specified range) and makes a comparison derived from a Euclidean geometry distance measurement. The DI in the NIH system also allows for comparison of two files spectra, which is very useful in checking for duplicates in the file. While this is an objective measure for comparison, mass spectrometry structure identification is still, we feel, basically subjective. Thus, we have provided a number of options that allow the user to judge for himself the quality of his answer. The first and simplest is the spectrum printout option, which simply types out all (or just a portion, if so specified) of the peaks and intensities in a spectrum.

Figure 8. Example of a Spectrum Printout

Figure 8 shows a typical spectrum printout. Alternatively, with a graphics terminal, such as the GT-40 or Tektronix 4010 being demonstrated at this ASI, the display option plots the data in a standard x-y bar plot. Lastly, the plot option allows the same data that has been on the display to be plotted on either a Caclomp or Zeta Plotter. The mass spectrum plot in Figure 1 was obtained using this option. The data base has also processed by a COM (Computer Output Microfiche) devise and it is available on microfiche, either using a manual or an automatic (and computer drive) viewer. Figure 9 shows the computer driven microfiche viewer next to the terminal user to drive the viewer. The image on the screen is one of 192 spectra on each fiche. The carousel in which the fiche reside inside the viewer requires a four second maximum access time for viewing. In addition to these options, there are miscellaneous options for entering data, commenting about the system and receiving news of the system. Lastly, a fragment loss search option is currently being added, which will allow for the searching of neutral loss from-1 (i.e. the parent ion (P) +1 peak) to P-100. Preliminary work in using this feature as an additional tool for structure elucidation has been very promising.

Figure 9. Computer Driven Microfiche Viewer

FILE STRUCTURE

Now that the system has been outlined, some description of the file structure is in order so that one may understand how and why the system operates. The specific examples are for a DEC PDP-10 computer, but the principles are applicable to any modern timesharing computer system. In fact, the system has recently been transferred to the GE-635 computer with a minor number of problems. Previous workers have considered that the search and retrieval system involved two problem areas--the representation of the data and the search algorithm. Actually, there is only one problem; the data representation and search algorithm are virtually indistinguishable and inseparable--if they are to be good. The file structure used in the mass spectral search is similar to Feldmann's file structure in that the method of searching is dictated by the file structure and is the reason for the interactive capability.

All the files in the mass spectral search system can be viewed as one of two types--one containing numeric search data (peaks, molecular weights, etc.) and one containing character search data (molecular formulae). If the data is binary or ASCII, packed or unpacked, these formats are just minor modifications of the basic file structure.

The abbreviated peak file consists of the identification number (ID#), peak, intensity, and molecular weight (MW), sorted by peak and ID# respectively. The file generation program takes this file and creates a pointer file which indicates whether or not there are any peaks at that mass and if so, where in the reference file the ID#, intensity, and MW are stored. The ID#, intensity and MW are all packed into one 36-bit word, 18 bits for the ID# (allowing for 2 or 262,143 spectra in the file), 7 bits for the intensity (allowing for the usual 0-100% intensity range), and 11 bits for the molecular weight (allowing for a MW up to 2048). When a search is performed, the value of peak is used to go to that word or cell in one of the 128-word blocks of the pointer disk file and if a non -1 number is found (-1 indicating no reference for that molecular weight), that number is the starting point for the ID#, intensity, and MW in the reference file. (The 128-word size for the block is for the PDP-10 computer disk block size. The GE-635 computer uses a 315 work block and so the file on the GE system contains the same information, but located in different blocks). The program then simply looks up this starting point (subtracting it from the starting point of the next mass will give the number of references for this mass), unpacks the intensity, and checks to see if it is in the allowable range. If it is, then the ID# is written out onto a scratch disk file to be used later for further search comparisons or to look up the name associated with the ID#. The molecular weight could readily be made a search parameter by allowing for a MW range and unpacking here and checking it in addition to the intensity. Now, when there is a second, third (and so on) search, the same basic look-up is done and a Boolean AND logic operation is performed between two lists of ID#'s. Figure 3 shows a typical MW of 103. Tables 2 and 3 are the pointer and reference files respectively for this MW search. These files were used since they are ASCII, rather than binary (which is the case for the peak reference, molecular formula pointer, and name reference files). The MW of 103 is used as a pointer to the 103rd word or cell in the 128-word block, which contains the pointer number 1563. Subtracting this form the 104th word -1, gives 1572-1563-1=8, which is the answer given in Figure 3. One then goes to block 1563/128 + 1=13 (with a remainder of 27) in the reference file and the 27th word/cell is, indeed, 809. The module function was used to calculate this value. The following seven numbers are the ID numbers shown in Figure 3. These ID# are then used in the identical name disk file look-up to go to the name pointer and reference files. One first uses the ID# to locate the starting point and length (by subtraction from the starting point of the next sequential ID#) of the name. The name, stored as five 7-bit binary characters per 36-bit word, is then printed out. The spectrum printout option works in the same manner, but in this case, the peak and intensity are stored in binary form, 18 bits each in one 36-bit word.

In the case of molecular formulae (MF), a different approach is necessary. Molecular formulae are character strings and do not have any particular numerical value associated with them. Thus, to be handled in the computer, it is necessary to convert the molecular formula string into a number or location in the disk pointer file. One method to use in "converting" a molecular formula to numeric key pointer value is called hash coding (16). In this procedure one processes the molecular formula character string through a function which produces a number. Clearly the function must be reasonably well chosen so that two different strings will rarely produce the same number and when such a "collision" occurs it can be handled with a minimum of difficulty. Now that one has converted the molecular formula string into a numeric key-value, this number is used to address the appropriate pointer file block. In this case, the block is divided into two-word cells as opposed to one-word cells for the previous file structure. This is necessary because one cannot use the hashed coded numeric key valued as both and indicator to the pointer file and an indicator to the reference file. The second word in each cell of the pointer file is used as the pointer to the starting position of the ID#'s in the molecular formula reference files. Tables 4 and 5 are examples of the molecular formula pointer and reference files. Tables 4 and 5 are examples of the molecular formula pointer and reference files. If one takes the first molecular formula in the file, Ar (for the inert gas Argon) and processes it through the hash function, the result is 4682565. The number, 4682565 is found in block 14 of the pointer file, cell #7 (word #13). In the second part of that cell is the starting address position of the references and the number of references to Argon are contained in the first word of the reference file. Indeed, the second word is 1, which is where the first reference would be expected to be and Table 4 shows that the first word of Block 1 is 5, the number of Argon references plus 1. The next four words are ID#'s for the Argon spectra. The reason for the number 5, instead of 4 is to take into account the space (i.e. one extra word) for the number of references. This, then is how the complete and partial molecular formula searches are performed.

In summary, the file structure attempts to maximize the use of the direct access disk and the 36-bit word length of the PDP-10. If one had to use a computer with a different word size, the system would, of course, still work, but would require some minor changes in the sizes of the cells to pack the peak, intensity and molecular weight.

Interpreting Mass Spectrometry Data

Table 2. Molecular Weight Pointer File

Table 3. Molecular Weight Reference File

This question of using smaller machines (that is, the so-called "mini-computer") is worth looking at from the standpoints of ease of use, value, or utility and its probable role in the future. At present, mini's are used for lab automation and often to perform numerous other tasks, but the direction in which technology is heading is towards mini-processors, which are cheaper and better to use. While search systems utilizing mini-computers have been well designed and might be viewed at first glace as a very effective alternative to the central search system described, here a few points are worth noting. First, the great variety of mini-computers with incompatible software makes use of available programs difficult. Second, assuming one did have such a mini computer available in his laboratory, it could be used at one time either for data acquisition or data analysis - not for both. As more instrumentation is computerized, there will probably be less time on a mini in the lab available for analysis/retrieval.

Interpreting Mass Spectrometry Data

Table 4. Molecular Formula Pointer File

Table 5. Molecular Formula Reference File

In addition to the problem of getting time on the lab computer, there is the initial expense of the data base, the time/cost for processing it for the search program. Then there is the time/cost for correcting errors in the data and then reprocessing the data base. Along with this, there is the time/cost of obtaining new data and processing it for the file.

Let us now assume that the potential routine users of such a system have been satisfied that a central system is the right direction in which to go, considering the time, costs, and lack of manpower expertise to perform the task. What of the researcher, the scientist who would like to experiment with the data base, either for alternative retrial methods or other types of data analysis (e.g., cluster analysis or pattern recognition)? At no cost to him, he has a very large data base available on-line and one that someone else will maintain, correct and augment. With a versatile search system or a program of his own, he may extract those parts of the data base of interest to his research and work with only that portion. Having a large and powerful computer system at his disposal, it is much easier to write in a higher level programming language and easier to perform experiments. Thus, the central search facility approach appears to offer a valuable service to the researcher also.

LIBRARY SEARCH SYSTEM SUMMARY

As Grotch (10), Isenhour (11), and others have stated, the most desirable attributes of a search system are generally thought to be:

1. Ease of use

2. Effectiveness

3. Availability

4. Low cost

The MSS system fits these four criteria very well. Its ease of use and availability are evident from its use (25 sessions per day) and the size of the user community (over 200 users). Its effectiveness or efficiency is good and is essentially the same as most other systems'. Lastly, its overall low cost is also quite obvious. These attributes led to the system's being transferred to the GE network. A demonstration of the search system described here will be available to ASI participants from the GE node on the network here in The Hague, The Netherlands. The management of and responsibility for the system is now being handle by the Mass Epec Data Centre (MSDC) in Aldermaston, England. In this way the value and use of the system to the world-wide mass spectrometry community can be tested and evaluated and the data base made available to research workers in the field.

Undoubtedly these concepts will take time to develop and evolve. Time-sharing systems are still in their infancy (17). At present, there is only one large international network (the GE/Honeywell network), and it goes only from Tokyo to Rome, leaving a large segment of the world outside the system. A few other companies are just starting to provide international networks, while most are regional (i.e., -serve a dozen large states and a few dozen large cities in the U.S. or perhaps a few cities in England). One major goal of this presentation is to show that research efforts devoted to developing systems (programs, data bases, etc.) that can readily and easily be made available over telephone lines to computer networks are of value to the scientific community. Using such a facility will enable more research time and effort to be devoted to the actual research rather than the presentations, program rewriting, data base reformatting, etc., that are so prevalent today.

CLUSTER ANALYSIS

An example of how this large file of mass spectral data being stored continuously on-line can be used for research will now be illustrated using a technique called cluster analysis which is usually considered a part of pattern recognition. Cluster analysis is concerned with finding homogeneous sets of data points within a larger set of data. Within cluster analysis there are a number of approaches. Only one will be used here. Further details are available in a number of books and articles (18). Cluster analysis is a method that classifies, associates or clusters a set of data into a number of natural sets containing similar data. It differs, generally speaking, from the pattern recognition techniques to be described later by Jurs, in that cluster analysis is usually non-supervised learning and not limited (i.e., the number of clusters are not known). Pattern recognition methods require pre-classification or pre-labeling of the data and a predetermined number of classes into which the data can fall. However, cluster analysis allows the data to fall into whatever classes the particular clustering techniques generate; i.e., the chips fall where they may. It should be pointed out that there are, however, situations in which the cluster analysis is unsupervised but limited. By this it is meant that while the data points are un-labeled, the number of classes into which they may fall is limited. This area will not be dealt with here, but rather an example of using the unlimited and non-supervised mode will be presented since it has a great dealt of potential value in chemistry in that it may enable the scientist to gain insight into the nature of the data he is analyzing. Such approaches have found value in analyzing data for taxonomy classification, economics, geography, and automatic generation of the thesauri (19). Clustering opens up the possibility of discovering new facts about old data or puzzling data.

There are a number of clustering techniques available. Kowalski and Bender have discussed the Q-mode clustering method and non-linear mapping (NLM) as two techniques for separating data where there are only a few clusters (20). This latter technique reduces the n-dimensional data to 2 dimensions so that it may be visualized as x, y points and was used to perform the recognition of the classes defined by the Q-mode method. The NLM method necessarily introduces some distortion into the final picture/display in two dimensions, but the guiding principal behind the mapping dictates that the close distances are preserved, and this in turn preserves the clustering of the data. NLM suffers from the disadvantage that the mapping produced can often be too dense for a good display. Another type of cluster analysis procedure is the use of graph theory methods and Zahn (12), in a comprehensive review of this subject, describes a number of algorithms for clustering based on the so-called "minimal spanning tree" (MST) techniques.

Data point sets come in a variety of patterns. Figure 10-12 show three sample problem data sets that cluster analysis has been used to solve. Figure 10 is the trivial case of two distinct clusters which can be readily separated. However, if one looks closely, on of the two sets of ten points can probably be separated into two smaller sets of five points. Figure 11 shows a more complicated case of a set of points attached via a short "neck" and is more difficult to separate. Lastly, Figure 12 shows a possible mass spectral data problem. In Figure 12 one can readily find anywhere from five to eleven (or more) clusters, depending on what criteria one uses for "similarity." Ball (18) has discussed a number of measures of similarity. They include the dot product, similarity radio, weighted and unweighted Euclidean distances and weighted and unweighted Boolean "AND" normalized correlation methods.

Interpreting Mass Spectrometry Data

Figure 10. Two Separate Clusters

Interpreting Mass Spectrometry Data

Figure 11. Two Clusters Joined at a Neck

Figure 12. Multi-Cluster Pattern

Figure 13 illustrates one of the graph-theory methods, the shortest spanning path (SSP) cluster procedure (22). A simple distance measurement to find the two clusters in the data set (steps A-D). In part A, one has the starting point set. One then connects all pairs of points as shown in part B. In step C one obtains the SSP. In part D, the "inconsistant" edge J-K has been deleted since it has a larger distance weight than any other neighboring edges and hence the data points are clustered. If one now considers the total weight of edges in a graph as the sum of all the constituent edges, the minimum or shortest spanning path (SSP) in this graph is a spanning path whose weight is a minimum among all the spanning paths of this graph.

Figure 13. SSP (Shortest Spanning Path) Cluster Procedure

The SSP method can also be used in data reorganization, that is, the data can be rearranged according to the "shortest spanning path" for the data. We have used the SSP method to rearrange parts of the mass spectrometry data base (23). It has been used because it is available, is able to handle large amounts of data, and allows for a convenient presentation of the clusteral data. No comparisons with other clustering techniques have been made using this mass spectral data, so that the relative efficiency of the SSP method cannot be indicated. Since finding an absolute minimum path with a large number of data points is time consuming, the SSP method uses an approximation technique which finds a relative minimum path. This procedure is a far less time consuming process and is able to handle large amounts of data. One very serious limitation to most previous clustering methods has been their inability to handle realistic amounts of real experimental data. While the SSP does have a favorable side to it, one obvious potential problem with the SSP method is shown in Figure 14, which is very similar to the data point samples in Figure 11. Anyone looking at this data will readily conclude there are two clusters, but the closeness of the sum of the point near the neck of the two clusters and the SSP technique of finding the closest neighbors from any given pairs of points would lead to the path shown in the diagram. If one labels one cluster A and the other B, then the resulting reorganized data will not be A and B, but rather BAB. Thus, one must be cautious in using the results directly and automatically.

Figure 14. The SSP Through a Set of Clusters

The compounds used in the sample clustering experiment were molecules containing one sulfur atom and any other element (in any combination and/or amounts) taken from the MSS data base. Using the imbedded molecular formula search for compounds with one sulfur atom, 525 answers were found. Because some of spectra do not have peaks starting at mass 26, these were discarded and 323 remained. Duplicates were not removed. The data was first used as is, without any weighting of the intensities. The peaks from m/e=14 to 140 and losses from 0 to -99 (i.e. Parent to Parent -99 amu) were considered as features for a total of 227 features. The losses were calculated from the known molecular weights of the compounds. The data was later weighted since it was found to produce better clustering. The weighting scale assigned a value of 1 to all peaks and losses with an intensity less than 10%, 2 to all peaks greater than 10%, and 4 to all losses greater than 10%. The matrix output consists of the sorted list of 323 rows of sulfur compounds (sorted according to the order of the SSP) and the 227 columns of features used to classify them. The computer program used has the ability to "sort" the 227 features as well as the compounds, but this feature analysis was discarded owing to the computer time necessary to perform this clustering and the relative lack of information that it seems to impart at this early stage of analysis. Without the sort feature, the SSP program required 86K words and about 30 minutes of cpu time (on a PDP-10) to cluster the 323 sulfur compounds. Work is underway to reduce both these factors.

We have analyzed the resulting matrix output, defining clusters partly based on the results of the SSP sorted list and partly on chemical intuition. This was done because the potential problem in Figure 14 appeared to occur in this case. Table 6 contains part of the sorted list of a sample run showing part of the sulfur cluster. It is fairly clear that some groups exist, such as: sulfur-halogen, sulfur-phosphorus, small compound, C2 sulfur compounds, etc. Table 7 contains a partial list of the clusters that have been tentatively identified from this method. These identifications were made with relatively little difficulty because of the matrix output representation that the SSP method produces. This output, which indicates which features (peaks, losses) are used for the clustering, distinguishes the SSP method from most other clustering techniques and is a value asset in the analysis. The true test of this technique will come as more data is made available and tested based on the results found from this small class of sulfur data.

Interpreting Mass Spectrometry Data

Table 6. Part of the Reorganized List of Sulfur Compounds

CLUSTER ANALYSIS CLASSIFICATIONS

1. SULFONES

2. THIAZOLES, BENZOTHIAZOLES

3. AROMATIC SULFUR

4. SULFUR-HALOGEN

5. SULFUR-PHOSPHORUS

6. SULFITES AND SULFATES

7. BICYCLIC SULFUR

8. THIOL ESTERS

9. ISOTHIOCYANATES, THIOCYANATES

10. C6-C8 THIOLS, SULFIDES

11. C5-C6 CYCLIC SATURATED THIOLS, SULFIDES

Table 7. Partial List of Clustered Groups

However, even now, the peaks and losses that the SSP clustering technique has shown to be characteristic for a group such as the monofunctional straight-chain alkyl thiol esters (RCOSR) have shown that this approach offers a great deal of potential for interpretating mass spectral data. This approach of recognition and classification of (functional) groups may lead to new empirical rules for structural elucidation.

Acknowledgments

I wish to express my deep appreciation to Dr. Henry M. Fales for his kindness and help with this work and his support of the search system. I also wish to thank Dr. Arnold W. Pratt, Director, Division of Computer Research and Technology for his foresight, continued support and interest in the computer applications work presented here.

REFERENCES

1. R. J. Feldmann, Chapter 3, "Proceedings of the NATO/CNAASI on computer Representation and Manipulation of Chemical Information" edited by W. T. Wipke, S. R. Heller, R. J. Feldmann, and E. Hyde, John Wiley, New York (1973).

2. P. C. Jurs, Ibid, Chapter 11

3. D. Smith, Ibid, Chapter 12

4. (a) D. H. Anderson and G. L. Covert, Anal. Chem., 39, 39, 1288 (1967).

(b) D. S. Erley, Ibid, 40, 894 (1968).

(c,d) F. E. Lytle, Ibid, 42, 355, 1532 (1970).

(e) P. C. Jurs, Ibid, 43, 364, (1971).

(f) D. S. Erley, Appl.. Spec., 25, 200 (1971).

5. P. D. Zemany, Anal. Chem., 22, 950 (1950).

6. (a) S. Abrahamsson, S. Stenhagen and E. Stenhagen, Biochem. J., 92, (1964).

(b) S. Abrahamsson, Science Tools, 14, 129 (1967).

7. (a,b) V. L. . Talrose, V. V. Baznikow and G. D. Tantsyrev, Dokl, Akad. Nauk SSR, 159, 182 (1964); 170, 379, (1996).

( ) Petterson and R. Rayhage, Arkly. Kemi, 26, 293 (1967).

(d) B. A. Knock, I. C. Smith, D. E. Wright and R. G. Ridley, Ibid, 42, 1516 (1970).

(e) L. R. Crowford and J. D. Morrison, Anal. Chem., 40, 1464 (1968).

8. H. S. Herz, R. A. Hites, and K. Biemann, Anal. Chem., 43, 681 (1971).

9. (a) F. W. McLafferty, Pure Appl.. Chem., 7, 61 (1971).

(b) K. Kwok, R. Venkataraghavan and F. W. McLafferty, J. Amer. Chem. Soc., 95,

4185 (1973).

10. (a) S. L. Grotch, Anal. Chem., 42, (1970).

(b) S. L. Grotch, Ibid, 43, 1362 (1971).

11. L. E. Wangen, W. S. Woodward, and T. L. Isenhour, Ibid, 43, 1605 (1971).

12. R. G. Ridley, Chapter 6, "Compound Identification by Computer Matching Mass Spectrometry," edited by G. R. Waller, John Wiley, (1971), and references therein.

13. (a) S. R. Heller, Anal. Chem., 44, 1951 (1972).

(b) S. R. Heller, H. M. Fales, and G. W. A.Milne, J. Chem. ed., 49, 725 (1972).

(d) S. R. Heller, D. A. Koniver, H. M. Fales, and G. W. A. Milne, Anal. Chem., in press.

(e) S. R. Heller, R. J. Feldmann, H. M. Fales, and G. W. A. Milne, J. Chem. Doc.,

in press.

14. (a) E. J. Corey and W. T. Wipke, Science, 166, 178 (1969)

(b) E. J. Corey, Quarterly Rev. 25, 455 (1971).

(c,d) E. J. Corey, W. T. Wipke, R. D. Cramer, and W. J. Howe, J. Amer. Chem, Soc., 94,

421, 431 (1972).

(e) E. J. Corey, R. D. Cramer and W. J. Howe, Ibid, 94, 440 (1972).

(f) E. J. Corey and G. A. Petterson, Ibid, 94, 460 (1972).

15. W. T. Wipke, Chapter 7, "Proceedings of the NATO/CNA ASI on computer Representation and Manipulation of Chemical Information" edited by W. T. Wipke, S. R. Heller, R. J. Feldmann, and E. Hyde, John Wiley, New York (1973).

16. R. Morris, Comm.. ACM., 11, 34 (1968).

17. Barrons, "Time Sharing Marches On," Barrons, January 15, 1973, p. 3.

18. (a) R. C. Tryson, "Cluster Analysis, " Edwards Bros. Inc. Ann Arbor, Michigan (1939).

(b) R. R. Sokal and P. H. A. Sheath, "Principles of Numerical Taxonomy, " Freeman, San Francisco (1963).

(d) G. H. Ball, Proc. AFIPS 1965 Fall Joint Comp.. Conf.., 27, Part 1, 533 (1965).

(e) G. Nagy, Proc. IEEE, 56, 836 (1968).

(f) R. F. Ling, Ph.D. Thesis, Yale University (1971).

(g) N. Jardin and R. Gibson, "Mathematical Taxonomy, " John Wiley (1971).

(h) R. Duda and P. Hart, "Pattern Classification and Scene Analysis, " John Wiley (1972).

(I) W. S. Meisel, "Computer-Oriented Approaches to Pattern Recognition," Chapter 8,

Academic Press New York (1972).

19. J. G. Auguston and J. Minker, J. Assoc. Comp.. Mach., 17, 571 (1970).

20. B. R. Kowalski and C. F. Bender, J. Amer. Chem. Soc., 94, 5632 (1972) and 95. 686

(1973).

21. C. T. Zahn, IEEE, Trans. on Computers, C-20, G8 (1971).

22. J. R. Slagle, C. L. Chang, and S. R. Heller, submitted for publication.

23. S. R. Heller, C. L. Chang, K.C. Chu, Anal. Chem., in press.