The Mass Spectral Search System
Stephen R. Heller
U. S. Environmental Protection Agency,
MIDSD, PM-218,
Washington, D.C. 20460
1. Introduction
With the mass spectrum of a pure compound in hand, only one additional but major step is required to make a correct identification: the interpretation of the data. Interpretation may be made by applying the theory of mass spectra, and the rules of fragmentation of ions in the gas
phase. This process is tedious, and it is difficult to sustain the necessary deductive reasoning process for the long periods of time required to make a large number of identifications. Lack of sufficient knowledge about the details for the fragmentation process further limits the effectiveness of this approach in interpreting some spectra. Instead, the mass spectrometrist
may take advantage of the collections of reference mass spectra that have been accumulated in recent
years. Empirical methods have been developed for searching a file of reference mass spectra to find a
similar or exact match of an experimental mass spectrum.
Any empirical search and match system has two fundamental components:
(i) a data base that is nothing more than an organized collection of reference data, and
(ii) the system or approach that a user takes to search the data base.
Manual searching of printed data bases was explored first, and some elaborate indexing schemes were
developed to facilitate the user's search. Nevertheless, all manual search systems are rather slow,
subject to human error, and mentally fatiguing. It is also difficult and expensive to update a printed data
base since the index usually requires complete revision. The application of computers overcomes many
of these problems, but computerized search systems are constrained by the size and validity of the data
base, the thoroughness of the searching algorithms, and the cost of using the system.
2. History and Organization
In 1971 the EPA undertook the operational development of a computerized search system patterned
after an approach developed by Biemann and his associates at the Massachusetts Institute of
Technology. A significant feature of this system is that data are transmitted over conventional voice-grade telephone lines directly from a minicomputer to a program running in a large-scale remote time-sharing computer.
The remote computer has access to the data base, conducts a search for a match based on the
transmitted mass and abundance data, and sends the results back to the minicomputer in a matter of
seconds. A major advantage of this system is that the names of compounds whose spectra are similar to
the spectrum of the unknown are automatically printed at the user's terminal. Furthermore, they are
printed in order of the similarity of their spectra to the spectrum of the unknown. The degree of
similarity is measured by a numerical value on a scale of 0 to 1 that is included on the printout. Since the
whole operation is relatively automatic, probable identification can be made without full-time interaction
with a highly trained spectrometrist.
A typical search of this type is shown in Fig. 1: user responses are underlined for clarity. Mass and
abundance data for seven salient ions (m/e 15, 5%, etc.) are input, and 16 spectra are found
to be similar to that of the ''unknown." The names of the best five are printed. Ethyl
isocyanide, with a similarity index of 0.999, most resembles the unknown.
The major part of the original data base used by the EPA was acquired from the Mass
Spectrometry Data Centre (see Chapter 8). This data base was augmented by 600
EPA pollutant spectra.
FIG. I. Typical Biemann-type search performed using the MSSS.
About the same time the Division of Computer Research and Technology (DCRT)
and National Heart, Lung and Blood Institute (NHLBI) of the National Institutes of
Health (NIH) implemented a matching system that, from the user's point of view, was
somewhat different. The data base selected was a slightly updated MSDC file, but the
data entry and the search procedures were different. The user enters mass and
abundance data, one pair at a time, from an inexpensive keyboard/printer terminal that
is interfaced to a conventional voice-grade telephone line. This terminal has no
obligatory connection to a GC-MS minicomputer, which allows a large group of users
of noncomputerized GC-MS systems to test and evaluate the spectrum matching
system. The user-entered data are transmitted to a large remote time-sharing computer
that has access to the data base. The search is conducted and the number of spectra
in the file having the mass/abundance pair is returned to the user in a matter of
seconds. By a repetition of this procedure the number of spectra that fit can be
minimized until the choice is among a small number of spectra. The user then requests
the names of these compounds to be printed at his terminal.
This type of search is illustrated in Fig. 2, again with user responses underlined. The
user finds first that 1008 spectra have the ion of mle 85 in high relative abundance.
The search progresses by providing additional m/e values (128, 29, 69) for ions in the
spectrum of the ''unknown," together with the corresponding relative abundances,
until only six of the 1008 spectra remain. All six are printed out, giving identification
numbers, molecular weights, molecular formulas, and names. No similarity index is
provided.
An important feature of the NIH System is that the user can search the data base with
information other than mass and abundance data. Spectra can be retrieved based on
molecular weights, partial or complete molecular formulas, mass losses from the
molecular ion, classification codes, and combinations of all of these. Furthermore,
complete spectra can be typed out, displayed, or plotted (Fig. 3) at the user's terminal.
Since the user must impose his judgment in entering data, this system is oriented to
the experienced user. The flexibility of this system permits its use in situations where a
good match is not available in the data base. Spectra can be retrieved that have
features similar to the experimental spectrum, and these provide clues to the identity
of the unknown. The NIH system has found wide use in many government
laboratories, including the EPA. In addition, a number of private, industrial, and
university laboratories have used the system.
3. The Mass Spectral Search System
With the development and refinement of these systems it became apparent that a consolidation of the two systems would be economical and beneficial. The EPA in conjunction with the NHLI, The Food and Drug Administration (FDA), and the MSDC is supporting the consolidation of the systems into an international Mass Spectral Search System (MSSS), which is part of a larger NIH/EPA Chemical Information System (CIS). The goal of this merger is to provide a user-oriented, flexible, and self supporting MSSS for the worldwide mass spectrometry community. The entire system is designed to encourage experimentation in the expectation that a better and more useful system will evolve. Significant advantages of the merged systems include worldwide access to the same data base, and
continuous updating of the data base.
To attain the goal of a truly worldwide system it was decided to
implement the MSSS on a commercial time-sharing computer system
supported by a well-developed communications network. The Interactive
Services Corporation time-sharing system was selected; this network is
accessible by a local telephone call from many cities in the U.S., Canada, Mexico,
Europe, Israel, Japan, Hong Kong, and Australia. A small subscription fee ($300) is
paid by each user and this is used for maintenance, storage, and ''update" costs for the
entire NIH/EPA CIS including the MSSS. The time-sharing system fees ($36 per
hour) are for computer and network connect-time only. A summary of current MSSS
options is given in Fig. 4.
FIG. 3. Typical line diagram produced by the MSSS
FIG. 4. Current and planned options for use with the MSSS.
3.1. Current MSSS Use
By early 1978 there were more than 250 laboratories using the system, with an
average of two new accounts being added each week. Over a hundred searches per
day were performed; about a fifth of these from EPA laboratories. Both EPA and
non-EPA use of the MSSS continues to grow and use of the MSSS has been
incorporated into Agency policy by requiring that EPA contractors doing GC-MS
contract analysis work use the MSSS. In fact, recent contracts for GC-MS analysis
totaling over 2.0 million dollars will use the MSSS-Biemann option as the primary
standard for identification, thus assuring the validity and comparison of results with
those run by other contractors and in other laboratories. The heavy and growing use
of the MSSS is clearly contributing to its improvement of data quality and quantity as
well as its acceptance as a worldwide standard reference for mass spectrometry.
3.2. Future Development of the MSSS
Currently funded MSSS research and development projects include a vigorous effort
to expand and improve the quality of the combined data base of mass spectra. The
present data base consists of an expanded version of the original MSDC file, another
collection acquired from John Wiley and Sons, and spectra collected by the EPA, and
is available from the U.S. National Bureau of Standards (NBS) on magnetic tape in
printed form. This amounts to over 31,000 unique spectra. The EPA and NIH have
contracted to acquire new spectra of particular interest to each agency. In addition,
contractors are evaluating the spectra in the present data base, removing erroneous
and duplicate spectra, and developing guidelines for the establishment of a large, high-quality file. All participating agencies are working to collect existing files of spectra
from spectrometrists throughout the world for inclusion in the data base. The EPA is
establishing collaborative efforts in data collection and software techniques with
environmental groups in Europe and environmental and agricultural groups in Canada.
Many contributions have been received from scientists around the world, making the
MSSS a truly user-oriented and user-accepted system. A goal of 40,000 high-quality
spectra of different compounds by 1980 has been set.
The original EPA-developed minicomputer-to-remote-computer direct transmission
system was compatible with only one GC-MS system, which employed a Digital
Equipment Corporation PDP-8 Processor. Work is in progress to develop
minicomputer and remote computer programs for many of the minicomputers that are
used on different GC-MS data systems. This effort is receiving some support from
GC-MS manufacturers and data system houses such as Finnigan, INCOS, Hewlett-Packard, VG-Data Systems, Systems Industries, and Varian-MAT; these companies
are participating in the development of direct transmission programs for their particular
minicomputers.
It is emphasized that the data base and the software for accessing and searching it are
separate and distinct. Therefore, a number of different and perhaps experimental
software search systems may be operational simultaneously with the same data base.
It is expected that in the near future new developments in software that use a mass
spectral data base will be available. Indeed, a user may wish to develop specialized
software and compare it to the existing operational software; this being encouraged by
the system designers. Future software includes structural interrogating systems (e.g.,
searches for all spectra of ,6-chloroamines), the Self-Training Interpretive and
Retrieval System (STIRS) developed by McLafferty and associates at Cornell
University (see p. 182), software based on learning machine or pattern recognition
techniques, software based of Wiswesser Line Notation (WLN) or Chemical Abstracts
Service (CAS) Registry Numbers (REGN), and structure connection table files.
Another item currently being implemented is to include the time and place of sampling
for pollutants as input along with unknown spectra. This would permit retrievals by
distribution of identified and even unidentified pollutants.
As virtually all GC-MS systems become computerized, another component of the MSSS is expected to evolve. This would be the capability of a GC-MS data system minicomputer to extract from the remote computer, by a direct telephone connection, a subset of the large, continuously updated master data base. This 500-2000 spectra mini-database would be retained at the local computer site and minicomputer software would be used to search the small library locally. Specialized users who have large numbers of unknowns in one area of concert pesticides, food additives, drugs--would have the benefit of decreased costs, yet would retain the advantages of a uniform,
and the backup of the large numbers of unknowns in one area of concern. This
approach is far more economical and feasible than an attempt to develop a complete
MSSS on a local spectrometer data system minicomputer. Support for a large data
base requires costly peripherals such as large disks for each minicomputer; flexible
search systems with many options require relatively large core memories for each
computer. It is difficult to develop time-sharing minicomputer software to permit
simultaneous data acquisition and data base searching. The maintenance of a large
data base on a small system is costly and time consuming; thus it tends to become
static.
With the further development of the MSSS, the effectiveness of computerized GC-MS should improve substantially at a small additional cost. The cost per identification
should continue to decrease dramatically in the next few years.
4. Suggested Reading
MSSS User's Manual, available from the CIS project, Interactive Services Corp.,
Suite 500, 918 16th St. NW, Washington, D.C. 20006 (telephone number, 202-223-6503 or 800-424-9600).
S. R. HELLER, Anal. Chem.. 47, 1972 (1951).
S. R. HELLER, G. W. A. MILNE, and R. J. FELDMANN, Science 195, 253 (1977).
The MSSS data on tape can be leased from the U.S. NBS. For details please contact
Dr.Lewis Gevantmann, NBS-OSRD, A323/221, Gaithersburg, Maryland 20234. The
MSSS in book form, EPA/NIH Mass Spectral Data Base by S. R. Heller and G. W.
A. Milne, is available from the U.S. Gov't. Printing Office as NBS publication
NSRDS NBS 63, stock number 003-003-01987-9 ($65, add 25% for other than U.S.
mailing).