THE ROLE OF PATTERN RECOGNITION IN THE COMPUTER-AIDED CLASSIFICATION OF MASS SPECTRA
WILLIAM S. MEISEL and MATT JOLLEY
Technology Services Corporation,
Santa Monica, CA 90403 (U.S.A.)
STEPHEN R. HELLER*
Environmental Protection Agency,
PM-218, Washington, DC 20460 (U.S.A.)
GEORGE W. A. MILNE
Bethesda, MD 20205 (U.S.A.)
(Received 28th June 1979)
The requirements for the use of pattern recognition techniques as an aid in the identification of chemical substances from their mass spectra are reviewed. Decision-tree pattern recognition is recommended as potentially satisfying these requirements. Examples of this approach using a large data base of mass spectra are provided.
In recent years, there has been considerable activity in the area of identification of unknowns from their spectral features. Mass spectrometry, infrared, proton and carbon-13 nuclear magnetic resonance have been the primary tools used . Early approaches involved the establishing of as large a library of spectral data as possible and use of one of a number of algorithms -to compare an unknown with the contents of the library . Later, combined library searching, based on combinations of spectral data was introduced , and finally, as libraries grew larger and comparison algorithms seemed to have reached their limits of usefulness, pattern recognition techniques appeared .
While all these approaches have virtues based on practical needs, they also all leave much to be desired. In particular, as libraries grow bigger, search times (costs) also grow; and pattern recognition techniques, for all their sophistication, have rarely been able to match the accuracy of simple library-search algorithms in the area of structure elucidation. Classification to the level of a particular chemical structure will probably continue to be best performed by a library search, provided that the particular structure is in the library.
The role that is envisioned for pattern recognition techniques in structure elucidation by mass spectrometry is twofold. First, pattern recognition may be used as a supplement to library search to signal a possible error in the final match. For example, if pattern classification algorithms indicate a lack of nitrogen in the compound with high confidence, and the library search matches with a nitrogen-containing compound, then the results must be viewed with caution. Secondly, pattern recognition techniques can aid identification of an unknown when it is not in the library by indicating the probability of the presence or absence of a certain chemical structure or substructure. This information can guide classical approaches based on experience or can be used as a constraint for a computer program which generates alternative structures .
The form of the results of such a pattern recognition algorithm, which uses a mass spectrum as input data, is a list of potential characteristics of the apposite functional groups (e.g. "nitrogen present", "aromatic rings present", etc.) along with a measure of the probability that the given characteristic applies to the unknown chemical. The probability that a characteristic is present can be stated both in relative and absolute terms; i.e. one can indicate the probability that the characteristic is present, either by including consideration of the frequency with which the characteristic occurs among representative chemicals or by ignoring such a consideration. In addition to an alphabetical list of characteristics, the characteristics should be listed with the most probable characteristics first and the least probable last. The ranked or ordered list gives a quick view of the most likely and least likely characteristics of the unknown.
In view of these aims, a classification algorithm should have four features. First, the algorithm should yield a measure of the probability that a characteristic is present, not simply a yes/no "best guess". Many classical approaches to pattern recognition simply yield an unequivocal classification of the unknown, with no indication as to the confidence in that classification. Secondly, the operational algorithm should be computationally efficient to minimize expense and to allow a large number of characteristics to be tested. (This does not require that the derivation of the algorithm be computationally efficient. Often the accuracy and efficiency of the operational algorithm are directly related to the effort in deriving the algorithm.) Thirdly, the algorithm should allow significant discrimination among classes, i.e., it should classify accurately when possible. An algorithm which always indicated that a characteristic has a 50:50 chance of being present would be of no utility. Fourthly, the algorithm should, to some degree, allow conditional use of some chemical measurements which may not always be available. For example, if the molecular weight of an unknown is known, classification can often be considerably more accurate than when it is not available. Similarly, if some part of the spectrum were omitted (e.g. low m/z values), it would be useful if some estimate of the probabilities were still possible.
THE DECISION-TREE APPROACH
Recent developments in pattern recognition research provide an approach which satisfies these demands--a decision-tree approach. A very simple decision tree is illustrated in Fig. 1. In this tree, which will be fully discussed later in this paper, the molecular weight (MW) of the unknown is assumed to be available. The tree is designed to identify the likelihood of the feature "chlorine atom(s) present"; class 1 is "chlorine present" and class 2 is "no chlorine present". The numbers used in Fig. 1 are valid; they were obtained by applying the trees to spectra that were not used in its creation.
Fig. 1. Decision tree to indicate the likelihood of the feature "chlorine atoms present".
A decision tree consists of a series of decision nodes connected by branches. In order to classify a given spectrum, the decision tree is used in a sequential manner. In Fig. 1, a spectrum would first be examined to see if a loss of 35 a.m.u. was present. (Loss is defined as from the assumed molecular weight.) Depending on the answer, different aspects of the spectrum would be examined. If "yes", a loss of 46 is sought in the spectrum; if "no", it is necessary to determine whether the highest m/z value present in the spectrum is more than two m/z units from the MW. In either case, the tree is followed along branches determined by the spectrum until a terminal node is reached at the bottom of the tree. At the terminal node, the relative likelihood of the feature being present in the molecule (in this case chlorine atom(s)) is given.
The relative likelihood is expressed as follows. The first two numbers indicated at each terminal node in Fig. 1 are the percentages of each class which would reach that terminal node. For the terminal node 8, 14.9% of class 1 and 1.4% of class 2 have the features corresponding to that node (a loss of 35, a loss of 46, and a peak at 35 with greater absolute abundance than 0.02 relative to the base peak of the spectrum). Thus, if compounds were chosen from an environment where chlorine was as likely to be present as not, for the node the odds would be 10.6 to 1, i.e., 14.9/1.4, that chlorine was present. This ratio is listed at each terminal node in Fig. 1. Assumptions as to the relative abundance of chlorine in the environment, i.e., the a priori probability of chlorine being present in the compound (irrespective of the spectrum), can be incorporated into this measurement of likelihood if deemed appropriate. The use of a priori probabilities is discussed below. The immediate discussion of ratios assumes equal a priori probabilities of each class. When the ratio is less than 1, as in node 9, the percentage of class 2 compounds reaching the node is greater. Thus, for that node, the odds are 1.9 to 1 against the presence of chlorineIn a classical pattern recognition approach, the terminal nodes at the bottom of the tree would simply be consigned to class 1 if the ratio exceeded 1, and to class 2 if not. From this classical viewpoint, nodes 5 and 7 achieve no purpose, as after the division indicated by those nodes, the classification (in the separation of classes sense) is the same irrespective of the outcome of that decision node. Thus, the present purpose of these nodes is to refine the estimate of confidence in the decision. It is particularly attractive to have very clear decisions such as in node 10, where the ratio is 40; here, one can be very confident that chlorine is present.
This particular small tree is intended to illustrate the decision-tree approach, not to reflect the full scope of the approach. A larger tree can use more characteristics of the spectrum and can be considerably more accurate. Further the aspects of the spectrum considered could be considerably more complex than those illustrated. This tree is consistent with the objectives set earlier for a decision-aiding scheme based on pattern recognition. First, the approach yields an estimate of the relative probability with which the feature is present. Secondly, the decision tree is computationally efficient. The features are calculated only when required. In the tree illustrated, no more than three decision nodes are required to evaluate any given spectrum, although there are a great deal more decision nodes which are potentially usable. Thirdly, the algorithm has potential for very accurate classification, as the types of features actually used in the decision nodes can be specific to the chemical under consideration. More aspects of the spectrum can be considered in the decision tree without running into limitations as to the number of example spectra, because in classifying any given compound, only a few of these aspects of the spectra are considered in making the decision. Thus, this form of the decision rule tends to make maximal use of the data available.
Fourthly, the algorithm allows for conditional use of features of the chemical spectrum. If a node down the tree requires information that is not available, the procedure can terminate at that point and still indicate the likelihood of the feature being present. As an example, consider node 5. If the information required to make the decision at node 5 were for some reason not available, the decision could be made on the percentages prior to node 5. Since there are 46.1% of class 1 and 2.6% of class 2 prior to that node, the ratio would then be about 18 to 1 (i.e., 46.1/2.6).
This form of classification algorithm is easily interpreted, in contrast to many other approaches. This is a great advantage in generating confidence in workers who are unconvinced of the value of pattern recognition techniques. This aspect of the decision-tree approach also allows the results to be used as a teaching aid in the classification of mass spectra.
THE USE OF A PRIORI PROBABILITIES
The a priori probability of a chemical characteristic is the frequency with which that characteristic occurs in the universe of chemicals under test. For example, if chlorine is present in 25% of the chemicals tested, the a priori probability of chlorine present is 0.25.
In discussing the decision tree of Fig. 1, the ratios at terminal nodes were treated as if the two classes "chlorine present" and "chlorine absent" had equal a priori probabilities. Thus, a ratio of 10 at a terminal node implies odds of 10:1 that chlorine is present only if the a priori probabilities are equal. If, for example, the a priori probability of chlorine being present was only 0.1, i.e., 1:10, then the implication of a ratio of 10:1 at a terminal node is that there is a 50% likelihood of the sample containing chlorine.
Suppose that a characteristic (e.g., "chlorine present") has the a priori probability P of occurring (and thus, the probability 1--P of not occurring). Then let the a priori ratio r(O) be defined as: r(O) = P/(1--P). If P is 1:11, then r(O) = 0.1, and the odds are 1:10 that any given compound will contain chlorine, irrespective of any consideration of the mass spectrum. The ratio R' for a terminal node, taking into account the a priori probabilities, is the ratio R calculated by assuming equal a priori probabilities, multiplied by r(O):
R'= R* r(O)
Thus, in this example, r(O) = 0.1 and R = 10, yielding R' = 1 (i.e., equal odds that chlorine is present or absent, considering both the a priori probability and the mass spectrum).
This provides a means of including the a priori probability in determining the ratio which allows assessment of the likelihood of a given feature being present. This raises two questions, however: when is it appropriate to use the a priori probabilities and how are they to be determined?
The latter question will be discussed first. The decision as to the appropriate a priori probabilities depends on the universe of chemicals from which the unknown is drawn. For example, if the compounds in the MSSS  are representative, the frequency with which the feature appears in that data base could be used. However, that universe of chemicals may not be representative of those from which the unknown was drawn in any particular application. It may, in fact, be quite difficult to decide what the appropriate a priori probability should be. To put this issue in perspective, the purpose of a given decision tree is to provide a comparative evaluation of the likelihood of one feature being present in contrast to other characteristics. Thus, what is sought is the a priori probability which reflects the relative likelihood of a given feature being present. If it can be assessed that, for example, nitrogen is twice as likely to be present as chlorine on an a priori basis, this can be reflected in the a priori probabilities.
The appropriate use of the a priori probabilities can now be discussed on the assumption that there is an acceptable assessment of these probabilities. Suppose that there is a feature that occurs relatively infrequently compared with other features (e.g., the compound is a polychlorinated biphenyl). If one simply looked at a list of adjusted ratios, i.e., those which incorporate the a priori probabilities based on eqn. (1), one would probably never consider that low-probability feature because the adjusted ratio, R', would be small, even when the unadjusted ratio, R. was relatively large. When R is large, it suggests that the unknown has a much larger chance than usual of having the feature in question. Similarly, for a common feature of a mass spectrum, a relatively high value of R' can result when R is less than 1, i.e., when the feature is less likely than usual to be present. Therefore it is suggested that a system for aiding in the classification of an unknown from its mass spectrum should indicate the unadjusted ratio, R. irrespective of the a priori probabilities.
If a priori probabilities are ignored entirely, then one may spend an inordinate amount of effort in considering relatively unlikely features simply because they are more likely than usual. Thus, it is also proper to consider R' in judging which features to explore. As a simple guideline, the features (chlorine present, aromatic ring present, etc.) can be ranked by the estimated R and R' separately. Then a particular feature will have a rank by each criterion; for example, "chlorine present" may be ranked the second most likely feature by R and the third most likely feature by R' The features can then be examined in order of their average rank (e.g., "chlorine present" would have an average rank of 2.5).
This approach to the use of pattern recognition in classifying mass spectra depends on having an algorithm which yields a measure of the likelihood of a characteristic which is comparable to a similar measure of likelihood generated by a separate tree for a different characteristic. Furthermore a likelihood measure is required that can be modified by the use of a priori probability. The decision-tree approach and the ratio measure satisfy these requirements. A specific example of this approach is given below.
A pilot study was carried out on a file of 4816 mass spectra from an older version of the NIH/EPA/MSDC MSSS master data base of 39,509 spectra . The purpose of the study was to classify the compounds into two categories: N. i.e. those containing nitrogen (actually 2162 spectra or 45% of the file), and NN, i.e. those not containing nitrogen (actually 2654 spectra or 55% of the file).
All the non-nitrogen compounds contained oxygen, and some of the nitrogen compounds contained oxygen. While this is not an absolutely unbiased data set, it is probably very close to one. The molecular weights of the compounds in the file ranged from 55 to 320. The latter weight was an arbitrary cut-off point and was used primarily to limit the size of the data base to under 5000 spectra in this illustrative pilot run. The m/z values were all integers between 1 and 320. Intensities of all m/z values were rounded off to the nearest integer (i.e. 0.4% = 0% and 15.7% = 16%) so that all intensities were between 0 and 100%.
First, the mass spectral data were separated, at random, into a learning set consisting of 75% of the spectra (1622 N. 1990 NN), and a test set consisting of the remaining 25% (540 N. 664 NN).
A tree-growing program (Classification and Regression Tree, CART, which is a software package owned by Technology Service Corporation), was applied to the learning set using three different sets of features. The first set (Case 1) was restricted to m/z values between 1 and 200, along with the corresponding intensities. The second set (Case 2) consisted of the m/z values from 1 to 200, and the losses (defined as the known molecular weight peak) for values from 0 (molecular ion) to 100 (molecular weight, 100), along with their corresponding intensities, were rounded off as stated above. Finally, the third set (Case 3) was formed by adding the molecular weight to the second set of features. In this last case, the molecular weight served only to tell whether the compound has an odd or even molecular weight. Each test set was run through each of the three trees constructed by CART. Table 1 shows the overall results when all spectra were classified. These results are not definitive, and should be interpreted only as a pilot test study. However, the high accuracy of classification achieved is very encouraging.
This pilot analysis provided some interesting results. It was unexpected that use of the m/z 1 - 100 values along with the 1--100 losses and the corresponding intensities instead of the 1--200 m/z values and their intensities gave no improvement in the overall classification rate.
However, when the use of the molecular weight was explored, it was found that including a single decision node, based on whether the molecular weight was even or odd, resulted in a dramatic increase in the classification rate, as in the number of decision nodes required. It should be emphasized that in the three cases tested the information content varied considerably. The first set required only m/z values and the second and third sets required the molecular weight to calculate losses; the third set also used the molecular weight to differentiate between even and odd values.
TABLE 1 - Percentage of correctly classified compounds (test set in parentheses)
APPROACHES TO TREE CONSTRUCTION
The methodology for deriving such trees is now discussed, guided by spectra of known classification. The basic objective of the decision-tree pattern recognition method is to classify the samples as accurately and efficiently as possible. Currently, there are no published approaches to finding the tree which truly minimizes the misclassification rate of known samples for a given size of tree. Payne and Meisel  have described a method for finding the smallest tree equivalent to a tree derived by any other method. Current work which extends this approach to direct minimization of the misclassification rate is nearly complete but computational limitations will restrict its applicability.
Current methodology uses a sub-optimal, one-node-at-a-time approach; thus, a tree is "grown" by adding branches. To begin, one must choose the first node. Two choices must be made: (1) the variable to be used in the node, and (2) the quantitative form of the decision made at that node. If there are N potential variables x(l), x(2), ..., x(n), and at each node decision rules are restricted to the form x(j) >= T, then j and T must be chosen at each node.
The most straightforward criterion for choosing j and T consists of minimizing the misclassification rate of known samples. Figure 2 shows how the node divides the samples of each class into those which satisfy the condition and those which do not. Assume, for clarity, that samples of Class 1 predominate in the left branch and that samples of Class 2 predominate in the right branch. Then the assignment of classification which minimizes the number of samples misclassified is to call samples in the left node Class 1 and to call those in the right node Class 2. This results in N(error) = n(2, L) + n(l, R) samples being misclassified.
For every choice of j and T. a different value of N(error) can easily be computed. One can then choose the values of j and T which give the minimum value of N(error)., i.e., the minimum number of misclassified samples.
Once the first node has been chosen, this approach can be repeated at any given node. For example, in Fig. 2, the same analysis can be applied to the n (1,R) samples in Class 1 and the n(2, R) of Class 2 in th e right-hand branch, yielding (probably) different values of j and T and adding another branch to the tree.
Fig. 2. Definition of notation for node-splitting.
A sequential procedure requires a stopping rule which indicates when the best values of j and T do not yield a statistically significant reduction in the misclassification rate. Most rules employed to date have been ad hoc, rather than based on theoretical significance computations. Typical approaches use a required minimum number of samples in each resulting branch and/or a maximum number of nodes in the tree.
The misclassification rate is the most easily understood criterion to use in choosing a decision rule at each node. Another possible criterion is an information measure; this was the measure used in deriving the nitrogen example above. Friedman  suggested the Kolmogorov--Smirnov distance. This distance was used in deriving the chlorine tree in Fig. 1. To date, no strong evidence of the superiority of any given measure has been found in this work.
Other methodology used in deriving trees has been outlined. Obviously, there are many variations on this theme. For example, the decision at each node need not be restricted to the use of only one variable; complex decision rules based on more than one variable can often be justified. When two variables are allowed, pairs of variables and the decision rule can often be chosen by looking at scatter plots as was done by Liles ; an interactive graphics package can be useful in this context.
Another extension of the methodology is to pick each node as optimal, looking ahead to the best rule at the resulting nodes. This two-steps-at-a-time approach involves much larger computation costs than the one-step-at-a-time approach.
One advantage of the iterative approach to tree construction described is that subjective considerations can be employed at each step. For example, at the top of the tree, one might limit consideration to variables which are seldom missing from a mass spectrum. Additionally, one might prefer a variable which came in a close second if it allowed a clearer physical interpretation than the variable which came in first.
1. See, e.g., J. Zupan, Anal. Chim. Acta, 103 (1978) 273.
2. R. S. Heller, G. W. A. Milne, R. J. Feldmann and S. R. Heller, J. Chem. Inf. Comp. Sci., 16 (1976) 176.
3. S. Sasaki, CHEMICS-F in Information Chemistry, University of Tokyo, 1975, p. 227.
4.P. C. Jurs and T. L. Isenhour, Chemical Applications of Pattern Recognition, Wiley, New York, 1975.
5. See, e.g., D. H. Smith (Ed.), Computer-Assisted Structure Elucidation, ACS Symp. Ser. 54, Washington, D.C., 1977.
6. Details of the latest MSSS data base available from: Dr. L. Gevantman, N.B.S., O.S.R.D., A323/221, Washington, D.C. 20234.
7.H. J. Payne and W. S. Meisel, IEEE Trans. Comput., C26 (1977) 905.
8.J. H. Friedman, IEEE Trans. Comput., C26 (1977) 404.
9.W. C. Liles, Interactive Decision Structuring System Manual, TSC publication, 1976.