The Mass Spectral Search System

Stephen R. Heller

U. S. Environmental Protection Agency,

MIDSD, PM-218,

Washington, D.C. 20460

1. Introduction

With the mass spectrum of a pure compound in hand, only one additional but major step is required to make a correct identification: the interpretation of the data. Interpretation may be made by applying the theory of mass spectra, and the rules of fragmentation of ions in the gas

phase. This process is tedious, and it is difficult to sustain the necessary deductive reasoning process for the long periods of time required to make a large number of identifications. Lack of sufficient knowledge about the details for the fragmentation process further limits the effectiveness of this approach in interpreting some spectra. Instead, the mass spectrometrist

may take advantage of the collections of reference mass spectra that have been accumulated in recent years. Empirical methods have been developed for searching a file of reference mass spectra to find a similar or exact match of an experimental mass spectrum.

Any empirical search and match system has two fundamental components:

(i) a data base that is nothing more than an organized collection of reference data, and

(ii) the system or approach that a user takes to search the data base.

Manual searching of printed data bases was explored first, and some elaborate indexing schemes were developed to facilitate the user's search. Nevertheless, all manual search systems are rather slow, subject to human error, and mentally fatiguing. It is also difficult and expensive to update a printed data base since the index usually requires complete revision. The application of computers overcomes many of these problems, but computerized search systems are constrained by the size and validity of the data base, the thoroughness of the searching algorithms, and the cost of using the system.

2. History and Organization

In 1971 the EPA undertook the operational development of a computerized search system patterned after an approach developed by Biemann and his associates at the Massachusetts Institute of Technology. A significant feature of this system is that data are transmitted over conventional voice-grade telephone lines directly from a minicomputer to a program running in a large-scale remote time-sharing computer.

The remote computer has access to the data base, conducts a search for a match based on the transmitted mass and abundance data, and sends the results back to the minicomputer in a matter of seconds. A major advantage of this system is that the names of compounds whose spectra are similar to the spectrum of the unknown are automatically printed at the user's terminal. Furthermore, they are printed in order of the similarity of their spectra to the spectrum of the unknown. The degree of similarity is measured by a numerical value on a scale of 0 to 1 that is included on the printout. Since the whole operation is relatively automatic, probable identification can be made without full-time interaction with a highly trained spectrometrist.

A typical search of this type is shown in Fig. 1: user responses are underlined for clarity. Mass and abundance data for seven salient ions (m/e 15, 5%, etc.) are input, and 16 spectra are found to be similar to that of the ''unknown." The names of the best five are printed. Ethyl isocyanide, with a similarity index of 0.999, most resembles the unknown.

The major part of the original data base used by the EPA was acquired from the Mass Spectrometry Data Centre (see Chapter 8). This data base was augmented by 600 EPA pollutant spectra.

FIG. I. Typical Biemann-type search performed using the MSSS.

About the same time the Division of Computer Research and Technology (DCRT) and National Heart, Lung and Blood Institute (NHLBI) of the National Institutes of Health (NIH) implemented a matching system that, from the user's point of view, was somewhat different. The data base selected was a slightly updated MSDC file, but the data entry and the search procedures were different. The user enters mass and abundance data, one pair at a time, from an inexpensive keyboard/printer terminal that is interfaced to a conventional voice-grade telephone line. This terminal has no obligatory connection to a GC-MS minicomputer, which allows a large group of users of noncomputerized GC-MS systems to test and evaluate the spectrum matching system. The user-entered data are transmitted to a large remote time-sharing computer that has access to the data base. The search is conducted and the number of spectra in the file having the mass/abundance pair is returned to the user in a matter of seconds. By a repetition of this procedure the number of spectra that fit can be minimized until the choice is among a small number of spectra. The user then requests the names of these compounds to be printed at his terminal.

This type of search is illustrated in Fig. 2, again with user responses underlined. The user finds first that 1008 spectra have the ion of mle 85 in high relative abundance. The search progresses by providing additional m/e values (128, 29, 69) for ions in the spectrum of the ''unknown," together with the corresponding relative abundances, until only six of the 1008 spectra remain. All six are printed out, giving identification numbers, molecular weights, molecular formulas, and names. No similarity index is provided.

An important feature of the NIH System is that the user can search the data base with information other than mass and abundance data. Spectra can be retrieved based on molecular weights, partial or complete molecular formulas, mass losses from the molecular ion, classification codes, and combinations of all of these. Furthermore, complete spectra can be typed out, displayed, or plotted (Fig. 3) at the user's terminal. Since the user must impose his judgment in entering data, this system is oriented to the experienced user. The flexibility of this system permits its use in situations where a good match is not available in the data base. Spectra can be retrieved that have features similar to the experimental spectrum, and these provide clues to the identity of the unknown. The NIH system has found wide use in many government laboratories, including the EPA. In addition, a number of private, industrial, and university laboratories have used the system.

3. The Mass Spectral Search System

With the development and refinement of these systems it became apparent that a consolidation of the two systems would be economical and beneficial. The EPA in conjunction with the NHLI, The Food and Drug Administration (FDA), and the MSDC is supporting the consolidation of the systems into an international Mass Spectral Search System (MSSS), which is part of a larger NIH/EPA Chemical Information System (CIS). The goal of this merger is to provide a user-oriented, flexible, and self supporting MSSS for the worldwide mass spectrometry community. The entire system is designed to encourage experimentation in the expectation that a better and more useful system will evolve. Significant advantages of the merged systems include worldwide access to the same data base, and

continuous updating of the data base.

To attain the goal of a truly worldwide system it was decided to

implement the MSSS on a commercial time-sharing computer system

supported by a well-developed communications network. The Interactive

Services Corporation time-sharing system was selected; this network is

accessible by a local telephone call from many cities in the U.S., Canada, Mexico, Europe, Israel, Japan, Hong Kong, and Australia. A small subscription fee ($300) is paid by each user and this is used for maintenance, storage, and ''update" costs for the entire NIH/EPA CIS including the MSSS. The time-sharing system fees ($36 per hour) are for computer and network connect-time only. A summary of current MSSS options is given in Fig. 4.

FIG. 3. Typical line diagram produced by the MSSS

FIG. 4. Current and planned options for use with the MSSS.

3.1. Current MSSS Use

By early 1978 there were more than 250 laboratories using the system, with an average of two new accounts being added each week. Over a hundred searches per day were performed; about a fifth of these from EPA laboratories. Both EPA and non-EPA use of the MSSS continues to grow and use of the MSSS has been incorporated into Agency policy by requiring that EPA contractors doing GC-MS contract analysis work use the MSSS. In fact, recent contracts for GC-MS analysis totaling over 2.0 million dollars will use the MSSS-Biemann option as the primary standard for identification, thus assuring the validity and comparison of results with those run by other contractors and in other laboratories. The heavy and growing use of the MSSS is clearly contributing to its improvement of data quality and quantity as well as its acceptance as a worldwide standard reference for mass spectrometry.

3.2. Future Development of the MSSS

Currently funded MSSS research and development projects include a vigorous effort to expand and improve the quality of the combined data base of mass spectra. The present data base consists of an expanded version of the original MSDC file, another collection acquired from John Wiley and Sons, and spectra collected by the EPA, and is available from the U.S. National Bureau of Standards (NBS) on magnetic tape in printed form. This amounts to over 31,000 unique spectra. The EPA and NIH have contracted to acquire new spectra of particular interest to each agency. In addition, contractors are evaluating the spectra in the present data base, removing erroneous and duplicate spectra, and developing guidelines for the establishment of a large, high-quality file. All participating agencies are working to collect existing files of spectra from spectrometrists throughout the world for inclusion in the data base. The EPA is establishing collaborative efforts in data collection and software techniques with environmental groups in Europe and environmental and agricultural groups in Canada. Many contributions have been received from scientists around the world, making the MSSS a truly user-oriented and user-accepted system. A goal of 40,000 high-quality spectra of different compounds by 1980 has been set.

The original EPA-developed minicomputer-to-remote-computer direct transmission system was compatible with only one GC-MS system, which employed a Digital Equipment Corporation PDP-8 Processor. Work is in progress to develop minicomputer and remote computer programs for many of the minicomputers that are used on different GC-MS data systems. This effort is receiving some support from GC-MS manufacturers and data system houses such as Finnigan, INCOS, Hewlett-Packard, VG-Data Systems, Systems Industries, and Varian-MAT; these companies are participating in the development of direct transmission programs for their particular minicomputers.

It is emphasized that the data base and the software for accessing and searching it are separate and distinct. Therefore, a number of different and perhaps experimental software search systems may be operational simultaneously with the same data base. It is expected that in the near future new developments in software that use a mass spectral data base will be available. Indeed, a user may wish to develop specialized software and compare it to the existing operational software; this being encouraged by the system designers. Future software includes structural interrogating systems (e.g., searches for all spectra of ,6-chloroamines), the Self-Training Interpretive and Retrieval System (STIRS) developed by McLafferty and associates at Cornell University (see p. 182), software based on learning machine or pattern recognition techniques, software based of Wiswesser Line Notation (WLN) or Chemical Abstracts Service (CAS) Registry Numbers (REGN), and structure connection table files. Another item currently being implemented is to include the time and place of sampling for pollutants as input along with unknown spectra. This would permit retrievals by distribution of identified and even unidentified pollutants.

As virtually all GC-MS systems become computerized, another component of the MSSS is expected to evolve. This would be the capability of a GC-MS data system minicomputer to extract from the remote computer, by a direct telephone connection, a subset of the large, continuously updated master data base. This 500-2000 spectra mini-database would be retained at the local computer site and minicomputer software would be used to search the small library locally. Specialized users who have large numbers of unknowns in one area of concert pesticides, food additives, drugs--would have the benefit of decreased costs, yet would retain the advantages of a uniform,

and the backup of the large numbers of unknowns in one area of concern. This approach is far more economical and feasible than an attempt to develop a complete MSSS on a local spectrometer data system minicomputer. Support for a large data base requires costly peripherals such as large disks for each minicomputer; flexible search systems with many options require relatively large core memories for each computer. It is difficult to develop time-sharing minicomputer software to permit simultaneous data acquisition and data base searching. The maintenance of a large data base on a small system is costly and time consuming; thus it tends to become static.

With the further development of the MSSS, the effectiveness of computerized GC-MS should improve substantially at a small additional cost. The cost per identification should continue to decrease dramatically in the next few years.

4. Suggested Reading

MSSS User's Manual, available from the CIS project, Interactive Services Corp., Suite 500, 918 16th St. NW, Washington, D.C. 20006 (telephone number, 202-223-6503 or 800-424-9600).

S. R. HELLER, Anal. Chem.. 47, 1972 (1951).

S. R. HELLER, G. W. A. MILNE, and R. J. FELDMANN, Science 195, 253 (1977).

The MSSS data on tape can be leased from the U.S. NBS. For details please contact Dr.Lewis Gevantmann, NBS-OSRD, A323/221, Gaithersburg, Maryland 20234. The MSSS in book form, EPA/NIH Mass Spectral Data Base by S. R. Heller and G. W. A. Milne, is available from the U.S. Gov't. Printing Office as NBS publication NSRDS NBS 63, stock number 003-003-01987-9 ($65, add 25% for other than U.S. mailing).