Conversational Mass Spectral Search System

Display and Plotting of Spectra and Dissimilarity Comparison

Stephen R. Heller and Deena A. Koniver

Heuristics Laboratory, Division of Computer Research and Technology, National Institutes of Health, Bethesda, MD 20014

Henry M. Fales and G. W. Milne

Laboratory of Chemistry, National Heart and Lung Institute, National Institute of Health, Public Health Service, Bethesda, MD 20014

An interactive, conversational mass spectral retrieval search system, available over ordinary telephone lines from a central PDP-10 time sharing computer in the Division of Computer Research and Technology at the National Institutes of Health (NIH) has been used by over 200 researchers since September 1971. This system has now been made available internationally over the GE computer network to scientists on four continents. This report describes the dissimilarity index comparison option, the microfiche option in the system, and the use of a graphics terminal for the display and plotting of spectral data. The overall purpose of this work is to explore and present a variety of ideas and methods for mass spectral data library searching and presentation of results.

The system uses the basic technique suggested by Biemann (*) of an "abbreviated spectrum" file consisting of the two most intense peaks in every 14 amu interval, beginning at m/e = 6. In addition, molecular weight and molecular formula about each compound are available for searching. Figure 1 shows the current options of the system. For details of those not described, the reader is referred elsewhere (2-4).

Experience with the Present System. The evolution and use of this system has helped to confirm the observations (5) that compounds can be identified from their mass spectra by using only a few peaks. Generally it has been found that a choice of 2-4 peaks from the abbreviated spectrum results in a list of a small number (usually less than 10) of possible (or probable) answers.

The system is used an average of 25 sessions per day by over 200 scientists in the United States and Canada. Each session usually consists of 3-6 separate searches and well over 90% of these are just the peak and intensity searches. It has also been found that the intensities of peaks are, in general, of minimal value, in narrowing down possible answers as Grotch and Isenhour have shown in their studies on compression of mass spectral data (6-8).

DCRT/CIS

MASS SPECTRAL SEARCH SYSTEM

1. PEAK AND INTENSITY SEARCH

2. MOLECULAR WEIGHT SEARCH

3. MOLECULAR FORMULA SEARCH

a. Complete

b. Imbedded

4. MOLECULAR WEIGHT AND PEAK SEARCH

5. MOLECULAR FORMULA AND PEAK SEARCH

6. MOLECULAR WEIGHT AND MOLECULAR FORMULA SEARCH

7. DISSIMILARITY COMPARISON

8. SPECTRUM PRINTOUT

9. MICROFICHE DISPLAY OF SPECTRUM

10. DISPLAY SPECTRUM

11. PLOTTING OF SPECTRUM

12. CRAB - COMMENTS and COMPLAINTS

13. HARVEST - ENTERING NEW DATA

14. NEWS - NEWS OF THE SYSTEM

15. MSDC CLASSIFICATION CODE LIST

Figure 1. Options of mass spectral search system

PROGRAM: MASS SPEC PEAK AND INTENSITY SEARCH

USER: INTENSITY RANGE FACTOR FOR THIS SEARCH IS: 5

TYPE PEAK, INT

CR TO EXIT, 1 FOR ID#/NAMES, 3 FOR ID, MW, MF, NAMES

USER: 149,100

PROGRAM: FOUND 853 SPECTRA WITH M/E PEAK: 149

#REFS M/E PEAKS

153 149

TYPE PEAK, INT

CR TO EXIT, 1 FOR ID #/NAMES, 3 FOR ID, MW, MF, NAMES

USER: 167,30

PROGRAM: FOUND 725 SPECTRA WITH M/E PEAK: 167

#REFS M/E PEAKS

15 149 167

TYPE PEAK, INT

CR TO EXIT, 1 FOR ID #/NAMES, 3 FOR ID, MW, MF, NAMES

USER: 113,10

PROGRAM: FOUND 936 SPECTRA WITH M/E PEAK: 113

#REFS M/E PEAKS

3 149 167 113

TYPE PEAK, INT

CR TO EXIT, 1 FOR ID #/NAMES, 3 FOR ID, MW, MF, NAMES

USER: 3

ID# MW MF NAME

219 390 C24.H38.04 DIOCTYL PHTHALATE

8219 390 C24.H38.04 DIOCTYL PHTHALATE

8466 390 C24.H38.04 ISO-OCTYL PHTHALATE

Figure 2. Typical search for dioctyl phthalate using intensity range factor of 5

PROGRAM: MASS SPEC PEAK AND INTENSITY SEARCH

USER: INTENSITY RANGE FACTOR FOR THIS SEARCH IS : 100

TYPE PEAK, INT

CR TO EXIT, 1 FOR ID#/NAMES

USER: 149,50

PROGRAM: FOUND 853 SPECTRA WITH M/E PEAK: 149

#REFS M/E PEAKS

853 149

TYPE PEAK, INT

CR TO EXIT, 1 FOR ID#/NAMES

USER: 167,30

PROGRAM: FOUND 725 SPECTRA WITH M/E PEAK: 167

#REFS M/E PEAKS

106 149 167

TYPE PEAK, INT

CR TO EXIT, 1 FOR ID#/NAMES

USER: 113,10

PROGRAM: FOUND 936 SPECTRA WITH M/E PEAK: 113

#REFS M/E PEAKS

8 149 167 113

TYPE PEAK, INT

CR TO EXIT, 1 FOR ID#/NAMES

USER: 3

ID# MW MF NAME

219 390 C24.H38.04 DIOCTYL PHTHALATE

277 0 XX MIMOSINE

4572 238 C12.H8.0.CL2 4-CHLOROPHENYL ETHER

7082 167 C12.H9.N CARBAZOLE

8071 296 C21.H44 2,6,10,14-TETRAMETHYLHEPTADECANE

8218 399 C22.H29.N3.S2 TORECANE

8219 390 C24.H38.04 DIOCTYL PHTHALATE

8466 390 C24.H38.04 ISO-OCTYL PHTHALATE

Figure 3. Typical search for dioctyl phthalate using intensity range factor of 100

Figures 2 and 3 show a search for Octoil, (dioctyl phthalate) using range factor of 5 (i.e., any intensity whose value lies between 1/5 or 5 times the entered intensity is an acceptable answer) and 100, using with peaks at 113, 149, and 167. The underlined information is the only information entered by the user. A range factor of 100 effectively eliminates the intensity from consideration since any intensity (from 1-100%) divided by or multiplied by 100 will lie between 0.01 and 100%. The differences between the two results are clearly negligible and a fourth peak entered in the search (not shown in Figure 3), reduces the suggested answers form the search to Octoil and 2, 6, 10, 14-tetramethylheptadecane. Thus, while compression of intensity data into one or two bits has been shown to be acceptable (6-8), the complete elimination of intensity data can now be considered (such as in cases such as those where storage is a concern) and this system allows for the flexibility of the user choosing whatever range, if any, he wishes.

The search routine usually gives responses that are identical or close to the correct answer. The system, as previously described (2, 3) requires the user to type in only a few peaks, and the list of possible answers given comes from a comparison of only those few peaks that are entered. The only information entered into the computer for the searches is that underlined in Figures 2 and 3, and this fact should be carefully noted. The search system does not search the entire library and does not use the entire spectrum but only the complete "abbreviated" spectrum. Thus, the system is not a sequential search and, therefore, does not increase in time and cost in a linear fashion.

Over 10,000 searches have been performed with the system and its continued use suggests it is valuable in leading the correct solutions to the wide variety of problems which the user list would seem to imply. This list of regular users includes, in addition to groups at the National Institutes of Health, the laboratories of the Environmental Protection Agency, Department of Agriculture, Interior, Justice, Bureau of the Mines, Departments of the Treasury, Army, Navy, Air Force, and NASA as well as many NIH grantees and contractors in medical schools, hospital, and universities. Several industrial concerns, especially pharmaceutical houses, are also represented.

Development of a Dissimilarity Index. In spite of the general success of the system, the use of so few peaks will, in some cases, lead to the false structural suggestions. Other mass spectral search system (4-13) have developed "similarity indexes," "mismatch values," etc. which give a quantitative relation of the unknown to the possible answers from the system. While we feel strongly that only by direct comparison of the actual spectrum and the unknown can one determine their true "similarity," an automatic measure of which spectra are similar and which are dissimilar could prove of some value. Many users expressed such a need through the use of the "CRAB" option (which allows users to type comments directly into a computer disk file) as well as in personal communications.

The result of this feedback is our development of a "dissimilarity index" (DI) which differs from previous work because of its emphasis on excessive weighting when a peak is present in the unknown and not in the reference (and vice-versa). It was derived from the Euclidian geometry distance formula and the final form reached by analyzing its utility in distinguishing mass spectra.

It is produced by comparing the intensities of the complete unknown spectrum with the intensities of a file spectrum. For this comparison, a file of the complete, not just the abbreviated, spectra is used. That is, the abbreviated file is used only for searching. The complete file (over 820,00 peaks and intensities for the 8782 spectra) is used here for comparison, as well as being used for the spectrum printout option and the display and plotting option described later in this paper. Because it was felt that differences should be heavily emphasized, the index was calculated by the Equation 1 (all of these decisions, it should be emphasized, are quite arbitrary).

Equation (1)

where K = 2 if and only, if I or I = 0; otherwise, K = 1.

The index uses the differences in the squares of the intensities at every mass from 12 to 400 and further emphasizes differences when there is a peak in one spectrum, but none in the other. (This extra emphasis was done to heavily accentuate differences or dissimilarities and does distort results in some cases.) In the latter case, the difference in the squares of the intensities are used, rather than the square root of the difference in the squares. (D - D , but since D = 0, the contribution is D rather than D = D.) This has the effect of providing, in general, reasonable separation for low molecular weight compounds and much greater separation for high molecular weight compounds. The method lead to great dissimilarity of non-homologous compounds because many peaks will be absent in one spectrum that are present in the other. It is felt that the search itself provides a sufficient degree of information regarding homologous compounds so that this index can be used to reflect other information. It has also been found that the empirically derived DI effectively separates isomers, terpenes, halogens containing compounds from mass spectra of similar compounds and is most valuable in this laboratory in checking supposed "duplicate" spectra in file. Using the hypothetical spectrum in Figure 4, the INDEX is calculated using Equation 2.

UNKNOWN REFERENCE

Figure 4. Hypothetical spectrum used for calculating dissimilarity index

TABLE 1

DISSIMILARITY INDEX COMPARISONS

Unknown: 3 - Butyn - 1 - ol (file Spectrum #2820)

MW = 70, MF = C H O

ID#	Compound	DI
1321	3-Butyn-1-ol	0.12
1164	Butudiene monoxide	0.38
2826	3,4 - Epoxy - 1 - butene	0.39
90	2,5 - Dihydrofuran	0.52
1295	2-Methylpropenal	1.69
1320	3-Butyn-2-ol	2.92
92	Dimethylketene	3.65
1	Methylvinyl ketone	24.77
7390	Divinyl ether	25.00

The factor of 400 is used to scale the INDEX numbers down to a range of 0 to about 50 and originated from the mass range being used. It is an arbitrary normalization constant. The upper value of DI is approximate due to the arbitrary nature of the scaling factor. In addition to the 12-400 mass range, the user has the option of changing the range to any range from 1-500 he wishes to examine. This ability to change ranges and compare regions of interest to the particular user is in keeping with the overall philosophy of flexibility of the system.

In the first case, a molecular formula (MF) search for C4.H6.O had produced 10 possible isomers. Using the 3-butyn-1-ol spectrum from the file (ID# 2820) as the unknown, the DI's of the other nine isomers are shown in Table I. The least dissimilar spectrum is that of another spectrum of 3-butyn-1-ol. Note also that the next two least dissimilar neighbors have DI's of 0.38 and 0.39, which suggests that they are compounds that are very similar to each other. Indeed, they are the same compound, differently named, as so often occurs in the file. Thus in both cases the DI has indicated the presence of duplicate spectra in the file and a decision to drop one or the other might be suggested. However, relative to 3-butyn-1-ol, ID# 1164 might have a DI equal to that of ID# 2826 for entirely different reasons and it would be dangerous to remove one or the other spectrum on this basis alone, without comparing them directly with each other.

TABLE II

DISSIMILARITY INDEX COMPARISONS

Unknown: Diphenyl Ether (File Spectrum #5351)

MW = 170, MF = C H O

ID#	Compound	DI
3918	Phenyl Ether	0.20
5350	1-Hydroxy-4-phenyIbenzene	0.26
3915	o-Phenylphenol	0.29
3929	p-Phenylphenol	0.31
3921	m-Phenylphenol	0.31
3919	m-Bromotoluene	3.76
6580	1,3-Dihydro-4,6 dimethylthieno [3,C] thiophene	17.00

Figure 5. Complete spectrum of heroin

Figure 6. Partial spectrum of heroin

Using diphenylether (ID# 5351) a molecular weight and peak search was performed with the base peak of m/e = 170 leading to six possible answers. The resulting compound and their DI's are shown in Table II, where again a duplicate spectrum has been discovered. A second pair, differently named again, has been uncovered (ID# 5350 and ID# 3920), but in this case the two spectra are not as similar to each other as they are to an isomer, ID# 3915. On the other hand, inspection of the actual spectra of o- and p-phenylphenol run under identical instrumental conditions allows an easy distinction to be made, pointing up the earlier caveat in regard to the use of this device.

Visual Display of Mass Spectral Data. To assist visual comparison of the spectra, a number of options were devised. The first was to display on a Cathode Ray Tube (CRT) a mass spectrum from the file as a bar plot. The routine uses the same disk file as the SPEC option previously described and was readily programmed using the PDP-10 OMNIGRAPH routines (11), a software package that drives a variety of displays using a general set of routines. The displays currently used are the DEC-340, DECGT-40, Tektronix-4000 series, ARDS, Adage AGT-40, and Computek 400. Examples of the display options using a spectrum of heroin are shown in Figures 5 and 6. These figures were not photographed from the display screen, but were produced by the plot option, DPLOT, contained within the OMNIGRAPH routine, which generates a file processed by PLOTS to produce a Calcomp tape, which drives the Calcomp plotter. Additionally the PLOTX routine can drive laboratory plotters such as the Zeta plotter. Figure 5 and 6 are the plots of the complete and partial spectrum of heroin. The program allows users the option of displaying any plotting of any 100 amu interval they desire, so that portions of the spectrum can be readily viewed.

As an alternative to these elegant but expensive methods, a computer generated microfiche file of spectra was developed to eliminate the necessity of the vast storage (3/4 million computer words at present) needed for the SPEC, DISPLAY, and PLOT options while providing convenient and inexpensive physical copies of the file. The entire 8772 spectra of the file have been produced on forty-seven 42X-reduction microfiche which are very inexpensive to duplicate in the aggregate. Manual microfiche readers are readily available for as little as about $100. Alternatively, computer driven microfiche viewers are also available. One such viewer installed in the Laboratory of Chemistry of the National Heart and Lung Institute is capable of holding over 150,000 mass spectra on a carousel device that allows immediate access, and viewing on any spectrum in a maximum of four seconds. It will also produce an 11-inch x 14-inch copy of a spectrum. A computer terminal with a telephone acoustic coupler is used and the microfiche image is shown on the screen. A Typical microfiche contains 192 spectra. A binary coded metal clip at the top of the microfiche is used by the computer controlled viewer to locate the appropriate fiche for display.

These new options of display, plotting, and microfiche, as well as the original spectrum printout, provide for a variety of methods for visually representing mass spectra and are offered to stimulate discussion of a variety of alternatives to meet various needs and available facilities.

There are a number of other modifications of the system that have been suggested during its use by mass spectroscopists. They include changing the number of peaks chosen in the 14-amu internal, and using the molecular weight as a filtering factor in peak searching similar to the way the intensity factor is now used. Also on-line updating of the file as users add new data is possible. The addition of the MSDC classification codes and structural information as well as a reverse spectrum search (searching for neutral losses) are also being considered and it is hoped this work will provide the basis for future experimentation by others.

Received for review May 10, 1973. Accepted December 27, 1973. This paper is part III in a series on an Interactive Conversational Mass Spectral Search System.

References

*Present address, Management Information and Data Systems Division, Environmental Protection Agency, Washington, D.C. 20460

(1) H. S. Hertz, R. A. Hites, and K. Biemann, Anal. Chem., 43, 681 (1971).

(2) S. R. Heller, Anal. Chem., 44, 1951 (1972).

(3) S. R. Heller, H. M. Fales, and G. W. A. Milne, Org. Mass Spectrum., 7, 107 (1973).

(4) S. R. Heller, R. J. Feldmann, H. M. Fales, and G. W. A. Milne, J. Chem. Doc., 13, 130 (1973).

(5) R. G. Ridley, Chapter 6, "Compound Identification by Computer Matching Mass Spectrometry," in "Biochemical Applications of Mass Spectrometry," G. R. Waller, Ed., John Wiley, New York, NY, 1971, and references therein.

(6) S. L. Grotch, Anal. Chem.,43, 1362 (1971).

(7) S. L. Grotch, Anal. Chem.,45, 2 (1973).

(8) L. E. Wangen, W. S. Woodward, and T. L. Isenhour, Anal. Chem.,43, 1605 (1971).

(9) V. L. Talrose, V. V. Raznikov, and G. D. Tantsyrev, Dokl. Akad. Nauk SSSR, 159, 182 (1964).

(10) B. A. Knock, I. C. Smith, D. E. Wright, and R. G. Ridley, Anal. Chem., 42, 1516 (1970).

(11) S. Abrahamasson, S. Stallberg-Stenhagen, and E. Stenhagen, Biochem. J., 92, 2P (1964)

(12) B. Pettersson and R. Ryhage, Ark. Kemi, 26, 2123 (1967).

(13) DCRT/CCB, "PDP-10 Display System Manual," November 1972.