Conversational Mass Spectral Search System
Display and Plotting of Spectra and Dissimilarity Comparison
Stephen R. Heller and Deena A. Koniver
Heuristics Laboratory, Division of Computer Research and Technology, National Institutes of
Health, Bethesda, MD 20014
Henry M. Fales and G. W. Milne
Laboratory of Chemistry, National Heart and Lung Institute, National Institute of Health, Public
Health Service, Bethesda, MD 20014
An interactive, conversational mass spectral retrieval search system, available over ordinary
telephone lines from a central PDP-10 time sharing computer in the Division of Computer
Research and Technology at the National Institutes of Health (NIH) has been used by over 200
researchers since September 1971. This system has now been made available internationally
over the GE computer network to scientists on four continents. This report describes the
dissimilarity index comparison option, the microfiche option in the system, and the use of a
graphics terminal for the display and plotting of spectral data. The overall purpose of this work
is to explore and present a variety of ideas and methods for mass spectral data library searching
and presentation of results.
The system uses the basic technique suggested by Biemann (*) of an "abbreviated spectrum"
file consisting of the two most intense peaks in every 14 amu interval, beginning at m/e = 6. In
addition, molecular weight and molecular formula about each compound are available for
searching. Figure 1 shows the current options of the system. For details of those not described,
the reader is referred elsewhere (2-4).
Experience with the Present System. The evolution and use of this system has helped to
confirm the observations (5) that compounds can be identified from their mass spectra by using
only a few peaks. Generally it has been found that a choice of 2-4 peaks from the abbreviated
spectrum results in a list of a small number (usually less than 10) of possible (or probable)
answers.
The system is used an average of 25 sessions per day by over 200 scientists in the United
States and Canada. Each session usually consists of 3-6 separate searches and well over 90% of
these are just the peak and intensity searches. It has also been found that the intensities of peaks
are, in general, of minimal value, in narrowing down possible answers as Grotch and Isenhour
have shown in their studies on compression of mass spectral data (6-8).
1. PEAK AND INTENSITY SEARCH
2. MOLECULAR WEIGHT SEARCH
3. MOLECULAR FORMULA SEARCH
a. Complete
b. Imbedded
4. MOLECULAR WEIGHT AND PEAK SEARCH
5. MOLECULAR FORMULA AND PEAK SEARCH
6. MOLECULAR WEIGHT AND MOLECULAR FORMULA SEARCH
7. DISSIMILARITY COMPARISON
8. SPECTRUM PRINTOUT
9. MICROFICHE DISPLAY OF SPECTRUM
10. DISPLAY SPECTRUM
11. PLOTTING OF SPECTRUM
12. CRAB - COMMENTS and COMPLAINTS
13. HARVEST - ENTERING NEW DATA
14. NEWS - NEWS OF THE SYSTEM
15. MSDC CLASSIFICATION CODE LIST
Figure 1. Options of mass spectral search system
PROGRAM: MASS SPEC PEAK AND INTENSITY SEARCH
USER: INTENSITY RANGE FACTOR FOR THIS SEARCH IS: 5
TYPE PEAK, INT
CR TO EXIT, 1 FOR ID#/NAMES, 3 FOR ID, MW, MF, NAMES
USER: 149,100
PROGRAM: FOUND 853 SPECTRA WITH M/E PEAK: 149
#REFS M/E PEAKS
153 149
TYPE PEAK, INT
CR TO EXIT, 1 FOR ID #/NAMES, 3 FOR ID, MW, MF, NAMES
USER: 167,30
PROGRAM: FOUND 725 SPECTRA WITH M/E PEAK: 167
#REFS M/E PEAKS
15 149 167
TYPE PEAK, INT
CR TO EXIT, 1 FOR ID #/NAMES, 3 FOR ID, MW, MF, NAMES
USER: 113,10
PROGRAM: FOUND 936 SPECTRA WITH M/E PEAK: 113
#REFS M/E PEAKS
3 149 167 113
TYPE PEAK, INT
CR TO EXIT, 1 FOR ID #/NAMES, 3 FOR ID, MW, MF, NAMES
USER: 3
ID# MW MF NAME
219 390 C24.H38.04 DIOCTYL PHTHALATE
8219 390 C24.H38.04 DIOCTYL PHTHALATE
8466 390 C24.H38.04 ISO-OCTYL PHTHALATE
Figure 2. Typical search for dioctyl phthalate using intensity range factor of 5
PROGRAM: MASS SPEC PEAK AND INTENSITY SEARCH
USER: INTENSITY RANGE FACTOR FOR THIS SEARCH IS : 100
TYPE PEAK, INT
CR TO EXIT, 1 FOR ID#/NAMES
USER: 149,50
PROGRAM: FOUND 853 SPECTRA WITH M/E PEAK: 149
#REFS M/E PEAKS
853 149
TYPE PEAK, INT
CR TO EXIT, 1 FOR ID#/NAMES
USER: 167,30
PROGRAM: FOUND 725 SPECTRA WITH M/E PEAK: 167
#REFS M/E PEAKS
106 149 167
TYPE PEAK, INT
CR TO EXIT, 1 FOR ID#/NAMES
USER: 113,10
PROGRAM: FOUND 936 SPECTRA WITH M/E PEAK: 113
#REFS M/E PEAKS
8 149 167 113
TYPE PEAK, INT
CR TO EXIT, 1 FOR ID#/NAMES
USER: 3
ID# MW MF NAME
219 390 C24.H38.04 DIOCTYL PHTHALATE
277 0 XX MIMOSINE
4572 238 C12.H8.0.CL2 4-CHLOROPHENYL ETHER
7082 167 C12.H9.N CARBAZOLE
8071 296 C21.H44 2,6,10,14-TETRAMETHYLHEPTADECANE
8218 399 C22.H29.N3.S2 TORECANE
8219 390 C24.H38.04 DIOCTYL PHTHALATE
8466 390 C24.H38.04 ISO-OCTYL PHTHALATE
Figure 3. Typical search for dioctyl phthalate using intensity range factor of 100
Figures 2 and 3 show a search for Octoil, (dioctyl phthalate) using range factor of 5 (i.e., any
intensity whose value lies between 1/5 or 5 times the entered intensity is an acceptable answer)
and 100, using with peaks at 113, 149, and 167. The underlined information is the only
information entered by the user. A range factor of 100 effectively eliminates the intensity from
consideration since any intensity (from 1-100%) divided by or multiplied by 100 will lie between
0.01 and 100%. The differences between the two results are clearly negligible and a fourth peak
entered in the search (not shown in Figure 3), reduces the suggested answers form the search to
Octoil and 2, 6, 10, 14-tetramethylheptadecane. Thus, while compression of intensity data into
one or two bits has been shown to be acceptable (6-8), the complete elimination of intensity data
can now be considered (such as in cases such as those where storage is a concern) and this
system allows for the flexibility of the user choosing whatever range, if any, he wishes.
The search routine usually gives responses that are identical or close to the correct answer.
The system, as previously described (2, 3) requires the user to type in only a few peaks, and the
list of possible answers given comes from a comparison of only those few peaks that are entered.
The only information entered into the computer for the searches is that underlined in Figures 2
and 3, and this fact should be carefully noted. The search system does not search the entire
library and does not use the entire spectrum but only the complete "abbreviated" spectrum.
Thus, the system is not a sequential search and, therefore, does not increase in time and cost in a
linear fashion.
Over 10,000 searches have been performed with the system and its continued use suggests it
is valuable in leading the correct solutions to the wide variety of problems which the user list
would seem to imply. This list of regular users includes, in addition to groups at the National
Institutes of Health, the laboratories of the Environmental Protection Agency, Department of
Agriculture, Interior, Justice, Bureau of the Mines, Departments of the Treasury, Army, Navy,
Air Force, and NASA as well as many NIH grantees and contractors in medical schools, hospital,
and universities. Several industrial concerns, especially pharmaceutical houses, are also
represented.
Development of a Dissimilarity Index. In spite of the general success of the system, the use
of so few peaks will, in some cases, lead to the false structural suggestions. Other mass spectral
search system (4-13) have developed "similarity indexes," "mismatch values," etc. which give a
quantitative relation of the unknown to the possible answers from the system. While we feel
strongly that only by direct comparison of the actual spectrum and the unknown can one
determine their true "similarity," an automatic measure of which spectra are similar and which
are dissimilar could prove of some value. Many users expressed such a need through the use of
the "CRAB" option (which allows users to type comments directly into a computer disk file) as
well as in personal communications.
The result of this feedback is our development of a "dissimilarity index" (DI) which differs
from previous work because of its emphasis on excessive weighting when a peak is present in the
unknown and not in the reference (and vice-versa). It was derived from the Euclidian geometry
distance formula and the final form reached by analyzing its utility in distinguishing mass
spectra.
It is produced by comparing the intensities of the complete unknown spectrum with the
intensities of a file spectrum. For this comparison, a file of the complete, not just the
abbreviated, spectra is used. That is, the abbreviated file is used only for searching. The
complete file (over 820,00 peaks and intensities for the 8782 spectra) is used here for
comparison, as well as being used for the spectrum printout option and the display and plotting
option described later in this paper. Because it was felt that differences should be heavily
emphasized, the index was calculated by the Equation 1 (all of these decisions, it should be
emphasized, are quite arbitrary).
Equation (1)
where K = 2 if and only, if I or I = 0; otherwise, K = 1.
The index uses the differences in the squares of the intensities at every mass from 12 to 400
and further emphasizes differences when there is a peak in one spectrum, but none in the other.
(This extra emphasis was done to heavily accentuate differences or dissimilarities and does
distort results in some cases.) In the latter case, the difference in the squares of the intensities are
used, rather than the square root of the difference in the squares. (D - D , but since D = 0, the
contribution is D rather than D = D.) This has the effect of providing, in general, reasonable
separation for low molecular weight compounds and much greater separation for high molecular
weight compounds. The method lead to great dissimilarity of non-homologous compounds
because many peaks will be absent in one spectrum that are present in the other. It is felt that the
search itself provides a sufficient degree of information regarding homologous compounds so
that this index can be used to reflect other information. It has also been found that the
empirically derived DI effectively separates isomers, terpenes, halogens containing compounds
from mass spectra of similar compounds and is most valuable in this laboratory in checking
supposed "duplicate" spectra in file. Using the hypothetical spectrum in Figure 4, the INDEX is
calculated using Equation 2.
UNKNOWN REFERENCE
Figure 4. Hypothetical spectrum used for calculating dissimilarity index
ID# | Compound | DI |
1321 | 3-Butyn-1-ol | 0.12 |
1164 | Butudiene monoxide | 0.38 |
2826 | 3,4 - Epoxy - 1 - butene | 0.39 |
90 | 2,5 - Dihydrofuran | 0.52 |
1295 | 2-Methylpropenal | 1.69 |
1320 | 3-Butyn-2-ol | 2.92 |
92 | Dimethylketene | 3.65 |
1 | Methylvinyl ketone | 24.77 |
7390 | Divinyl ether | 25.00 |
The factor of 400 is used to scale the INDEX numbers down to a range of 0 to about 50 and
originated from the mass range being used. It is an arbitrary normalization constant. The upper
value of DI is approximate due to the arbitrary nature of the scaling factor. In addition to the 12-400 mass range, the user has the option of changing the range to any range from 1-500 he wishes
to examine. This ability to change ranges and compare regions of interest to the particular user is
in keeping with the overall philosophy of flexibility of the system.
In the first case, a molecular formula (MF) search for C4.H6.O had produced 10 possible
isomers. Using the 3-butyn-1-ol spectrum from the file (ID# 2820) as the unknown, the DI's of
the other nine isomers are shown in Table I. The least dissimilar spectrum is that of another
spectrum of 3-butyn-1-ol. Note also that the next two least dissimilar neighbors have DI's of
0.38 and 0.39, which suggests that they are compounds that are very similar to each other.
Indeed, they are the same compound, differently named, as so often occurs in the file. Thus in
both cases the DI has indicated the presence of duplicate spectra in the file and a decision to drop
one or the other might be suggested. However, relative to 3-butyn-1-ol, ID# 1164 might have a
DI equal to that of ID# 2826 for entirely different reasons and it would be dangerous to remove
one or the other spectrum on this basis alone, without comparing them directly with each other.
ID# | Compound | DI |
3918 | Phenyl Ether | 0.20 |
5350 | 1-Hydroxy-4-phenyIbenzene | 0.26 |
3915 | o-Phenylphenol | 0.29 |
3929 | p-Phenylphenol | 0.31 |
3921 | m-Phenylphenol | 0.31 |
3919 | m-Bromotoluene | 3.76 |
6580 | 1,3-Dihydro-4,6 dimethylthieno [3,C] thiophene | 17.00 |
Figure 5. Complete spectrum of heroin
Figure 6. Partial spectrum of heroin
Using diphenylether (ID# 5351) a molecular weight and peak search was performed with the
base peak of m/e = 170 leading to six possible answers. The resulting compound and their DI's
are shown in Table II, where again a duplicate spectrum has been discovered. A second pair,
differently named again, has been uncovered (ID# 5350 and ID# 3920), but in this case the two
spectra are not as similar to each other as they are to an isomer, ID# 3915. On the other hand,
inspection of the actual spectra of o- and p-phenylphenol run under identical instrumental
conditions allows an easy distinction to be made, pointing up the earlier caveat in regard to the
use of this device.
Visual Display of Mass Spectral Data. To assist visual comparison of the spectra, a
number of options were devised. The first was to display on a Cathode Ray Tube (CRT) a mass
spectrum from the file as a bar plot. The routine uses the same disk file as the SPEC option
previously described and was readily programmed using the PDP-10 OMNIGRAPH routines
(11), a software package that drives a variety of displays using a general set of routines. The
displays currently used are the DEC-340, DECGT-40, Tektronix-4000 series, ARDS, Adage
AGT-40, and Computek 400. Examples of the display options using a spectrum of heroin are
shown in Figures 5 and 6. These figures were not photographed from the display screen, but
were produced by the plot option, DPLOT, contained within the OMNIGRAPH routine, which
generates a file processed by PLOTS to produce a Calcomp tape, which drives the Calcomp
plotter. Additionally the PLOTX routine can drive laboratory plotters such as the Zeta plotter.
Figure 5 and 6 are the plots of the complete and partial spectrum of heroin. The program allows
users the option of displaying any plotting of any 100 amu interval they desire, so that portions
of the spectrum can be readily viewed.
As an alternative to these elegant but expensive methods, a computer generated microfiche
file of spectra was developed to eliminate the necessity of the vast storage (3/4 million computer
words at present) needed for the SPEC, DISPLAY, and PLOT options while providing
convenient and inexpensive physical copies of the file. The entire 8772 spectra of the file have
been produced on forty-seven 42X-reduction microfiche which are very inexpensive to duplicate
in the aggregate. Manual microfiche readers are readily available for as little as about $100.
Alternatively, computer driven microfiche viewers are also available. One such viewer installed
in the Laboratory of Chemistry of the National Heart and Lung Institute is capable of holding
over 150,000 mass spectra on a carousel device that allows immediate access, and viewing on
any spectrum in a maximum of four seconds. It will also produce an 11-inch x 14-inch copy of
a spectrum. A computer terminal with a telephone acoustic coupler is used and the microfiche
image is shown on the screen. A Typical microfiche contains 192 spectra. A binary coded metal
clip at the top of the microfiche is used by the computer controlled viewer to locate the
appropriate fiche for display.
These new options of display, plotting, and microfiche, as well as the original spectrum
printout, provide for a variety of methods for visually representing mass spectra and are offered
to stimulate discussion of a variety of alternatives to meet various needs and available facilities.
There are a number of other modifications of the system that have been suggested during its
use by mass spectroscopists. They include changing the number of peaks chosen in the 14-amu
internal, and using the molecular weight as a filtering factor in peak searching similar to the way
the intensity factor is now used. Also on-line updating of the file as users add new data is
possible. The addition of the MSDC classification codes and structural information as well as a
reverse spectrum search (searching for neutral losses) are also being considered and it is hoped
this work will provide the basis for future experimentation by others.
Received for review May 10, 1973. Accepted December 27, 1973. This paper is part III in a
series on an Interactive Conversational Mass Spectral Search System.
References
*Present address, Management Information and Data Systems Division, Environmental
Protection Agency, Washington, D.C. 20460
(1) H. S. Hertz, R. A. Hites, and K. Biemann, Anal. Chem., 43, 681 (1971).
(2) S. R. Heller, Anal. Chem., 44, 1951 (1972).
(3) S. R. Heller, H. M. Fales, and G. W. A. Milne, Org. Mass Spectrum., 7, 107 (1973).
(4) S. R. Heller, R. J. Feldmann, H. M. Fales, and G. W. A. Milne, J. Chem. Doc., 13, 130
(1973).
(5) R. G. Ridley, Chapter 6, "Compound Identification by Computer Matching Mass
Spectrometry," in "Biochemical Applications of Mass Spectrometry," G. R. Waller, Ed.,
John Wiley, New York, NY, 1971, and references therein.
(6) S. L. Grotch, Anal. Chem.,43, 1362 (1971).
(7) S. L. Grotch, Anal. Chem.,45, 2 (1973).
(8) L. E. Wangen, W. S. Woodward, and T. L. Isenhour, Anal. Chem.,43, 1605 (1971).
(9) V. L. Talrose, V. V. Raznikov, and G. D. Tantsyrev, Dokl. Akad. Nauk SSSR, 159, 182
(1964).
(10) B. A. Knock, I. C. Smith, D. E. Wright, and R. G. Ridley, Anal. Chem., 42, 1516 (1970).
(11) S. Abrahamasson, S. Stallberg-Stenhagen, and E. Stenhagen, Biochem. J., 92, 2P (1964)
(12) B. Pettersson and R. Ryhage, Ark. Kemi, 26, 2123 (1967).
(13) DCRT/CCB, "PDP-10 Display System Manual," November 1972.