Mass Spectrometry Databases and Search Systems
Stephen R. Heller
Agricultural Research Service
US Department of Agriculture
Beltsville, MD 20705 USA
This chapter will provide the reader with a discussion
of mass spectrometry databases and some examples of
library search systems used in mass spectrometry.
The development of mass spectral databases started in
the 1940's with the American Petroleum Institute (API)
Project 44 activities. The reason that mass spectrometry
database activity goes back so far is, no doubt, due to
the nature of mass spectral data. The mass spectrum of a
chemical produces data which are ideally suited for
representation and manipulation in digital form.
Compared to Infrared (IR) and Nuclear Magnetic Resonance
(NMR) spectral data, mass spectra are the ultimate of
simplicity. Just peaks and intensities. However, this is
not to say the data are simple or easy to understand or
interpret or correlate.
While the API Project 44 continued on over the years,
other groups began to initiate their own mass spectral
data collections. It was not until 1965 that the British
Government initiated funding their Atomic Weapons
Research Establishment (AWRE) at Aldermaston for the
purpose of creating a world-wide database of mass
spectra. This project funded a group which became known
at the Mass Spectrometry Data Centre (MSDC) at
Aldermaston. A few years later, the US National
Insitutes of Health (NIH) Laboratory of Chemistry, which
was heavily involved in mass spectrometry, began the
development of a computer based library retrieval system
using this MSDC database and one provided by Professor
Biemann at MIT.
As the computer system developed (and this will be
discussed in detail later in this chapter), it became
clear there were a number of problems with the database,
both in quantity and quality. It is likely that these
issues were noticeable only due to the fact the database
was actually being used every day in an online system by
practicing mass spectrometrists. The result of this
computer retrieval project led the NIH, and in later
years the US Environmental Protection Agency (EPA),
along with the US National Bureau of Standards (NBS),
and the US Food and Drug Administration (FDA) to begin
to work with the MSDC in enlarging the database and
bringing quality assurance and quality control into the
database activity project [1].
In addition to these Anglo-American efforts, a second
major effort was initiated by Stenhagen and Abrahamsson
in Sweden. This effort was later joined by McLafferty.
As When Stenhagen, and then later Abrahamsson died,
McLafferty took over this database development and
maintenance. Today this database is known as the Mass
Spectrometry Registry and is distributed by the
publisher, John Wiley and Sons.
Other activities in database development have taken
place at the Atomic Energy Laboratory in Grenable,
France under the direction of Cornu and Massot. A small
database of 2000 mass spectra of chemicals of biological
interest was compiled by Markey. Cairns and Jacobson of
the US FDA compiled a database of some 2000 mass spectra
of pesticides and industrial chemicals. Sorenson at
Agriculture Canada compiled 300 mass spectra of drugs
used in horse racing. The API Project 44 collection
continued for many years under the direction of
Zwolinski at Texas A&M as part of the Thermodynamics
Research Center data collection activities. Shackelford
at the US EPA collected a database of some 1500 mass
spectra of pollutants which had been found in water
analyses. Ryhage at the Karolinska Institute in Sweden
collected about 2500 mass spectra of chemicals studied
in research activities in this research center in
Stockholm.
The list goes on. but by now the reader should easily
see that mass spectrometry data collection was very much
a cottage industry for the most part. with just two
major efforts. These two efforts were the US-UK group
and the McLafferty group. Over the past decade this has
remained the case' and today there are two major
collections of mass spectral databases in the world. Of
course there are many mass spectral database collections
which can be found in industrial labs throughout the
world. However, these collections, a number of which are
reported to contain over 100,000 spectra of different
compounds (such as would be expected in the flavor and
fragrance industry) will never see the light of day,
owing to the need for corporate secrecy. Because of
intense concern over trade secrets and competition in
many industries, corporate lawyers see no reason to be
generous and donate useful Mass spectral data to the
scientific community.
The first of these two major efforts to be discussed in
some detailed is the NBS mass spectral databases which
contains some 43,000 mass spectra of an equal number of
chemicals. Only one spectrum per compound is to be found
in this database. All multi (but not necessarily exact
duplicate) spectra have been removed, by a process
described later. All labeled compounds have been removed
from the database, so no deuterium, and the like spectra
will be found. Each spectrum has had a Chemical
Abstracts Service (CAS) Registry number assigned to the
chemical which produced the spectrum. Each chemical has
a CAS names, as well as many other names as could be
found, both formal (i.e., IUPAC), trival, english and
foreign language (but not Japanese, Arabic, cryllic. and
so forth). Each spectrum has a quality index (QI), which
ranges from O to 999, calculated and assigned to the
spectrum. When a new spectrum for the same chemical is
received, a QI is calculated and compared to the one
already in the file. If the new QI is higher than the
current QI, the new spectrum replaces the current
spectrum in the database, and the current spectrum is
placed in an archive file. This archive file, which is not
available at the present to users, contains well over
75,000 spectra. and includes all the multiple copies of
spectra and all labeled spectra [2].
The Stenhagen, Abrahamsson, and McLafferty mass spectral database, hereafter called the Wiley database, is similar to the NBS database in many ways. The database is larger, containing some 80,000 spectra. The main reason for this is that the Wiley collection includes multiple copies
of the mass spectrum of a chemical when the spectra are different. (Different as determined qualitatively by the author of the database.) The database also includes the spectra of labeled chemicals which have been left out of the NBS collection. The Wiley collection uses the Wissesser Line Notation (WLN) as the method of trying to uniquely identify the structure of the chemical associated with each spectrum. I use the word "try" very
correctly, since WLN in not a canonical notation.
A canonical notation is one that will produce a unique
structure from a given structure representation. A WLN
notation, used to represent a chemical structure, can and
does give rise to more than one structure. That is, two
different structures can and do have the same WLN. For
this reason modern structure representation systems no
longer use WLN as their primary structure representation.
Today, connection tables are used for structure
representation. It should be noted that while the NBS
database has a CAS Registry number for every entry, the
Wiley database does not. Somewhat over 2/3 of the Wiley
database has CAS Registry numbers. The situation for the
WLN in the Wiley collection is somewhat worse. There are
WLN structure notations for slightly over 50% of the
spectra in the database. The NBS database, which uses the
CAS Registry number (and associated connection table
Structure record), does not contain the WLN, except as a
synonym along with other chemical names. In addition, the
Wiley collection has a QI for every spectrum, although
the method use to calculate the QI differs slightly from
the one used by the NBS project [3].
Before leaving the issue of databases, it is worthwile to mention a third database from the MSDC, which is their eight peak spectra database. As the name implies the database is comprised of the eight largest peaks in each spectrum. not the entire spectrum. (Of course, if a spectrum consists of eight or less peaks, the spectrum in the MSDC database will be the complete spectrum. Besides
ethane, water, and a few other very simple , this is not
the case.) The 8 peak database from MSDC contains some
70,000 spectra, including duplicates, and is available
from the MSDC. The older MSDC complete spectra are also
available [4]. A summary of the electron impact (EI) mass
spectra databases is given in Table 1.
QUALITY CONTROL/QUALITY EVALUATION
A obvious concern of the scientific community regarding
these mass spectral databases is over the quality of the
spectra contained in the files. As the US National Bureau
of Standards (NBS), Office of Standard Reference Data
(OSRD) wqs one of the early participants and sponsors of
the one of the major database efforts in mass
spectrometry, this issue arose early. Methods were
quickly devised to control the quality of the chemical
nomenclature and structure associated with each spectrum.
The Chemical Abstracts Service (CAS) Registry number, a
sort of social security number for a chemical, was
accepted as the unique identifier, and the CAS
nomenclature used as the primary name. In the development
of the method or algorithm used to determine the quality
of a spectrum, a semi-qualitative method was devised. as
no absolute measurement of a mass spectrum is known
[5,6].
In 1974 the US-UK group decided to remove redundant-
or multiple copies of spectra from the file. This decision
was reached as it was felt by most everyone that they
served little purpose and were taking up valuable storage
space and computer search time. The name(s) of every
compound in file were sent to Chemical Abstracts Service
where, under contract to the US EPA, CAS identified the
CAS Registry number for a compound. The first step in the
process was to perform a simple name match. When this did
not succeed, the structure of the chemical was matched
against the structure in the CAS file of a few million
chemicals. If this second Step failed, then it was
determined that the chemical was not in the CAS file
(which numbered some 4-5 million at the time), and a new
CAS Registry number was assigned to the chemical.
When this CAS registration step was complete the next step was to devise a method to decide which of several spectra in a number of cases was the best one. The approach used was to use the experience of practicing mass spectrometrists. As the mass spectromety of organic
compounds developed during the 1960's and early 1970's,
spectrometrists became familiar with the types of errors
that occur frequently in recorded mass spectra.
Responses ranging from modification of experimental
procedures to redesign of spectrometers were adopted to
eliminate or minimize these errors. The result is that a
conscientious analyst using a modern mass spectrometer
can produce mass spectra which rarely, if ever, contain
such errors. Thus the US EPA funded a project to develop
an algorithm which examines a mass spectrum for the
occurrence of such standard errors. The program computes
a number, which its called the Quality Index (QI), and
is a measure or indicator of the quality - in terms of
the absence of standard errors - of the spectrum.
The QI algorithm, employee seven (7) quality factors (QF), each having a value between zero (0) and one (1).
Multiplication together of all these quality factors and
further multiplication of the product by 1000 leads to
the quality index (QI) for the spectrum. The quality
factors now being used by the NBS Office of Standard
Reference Data are:
QF1. The electron voltage
QF2. Peaks above the molecular weight
QF3. Illogical neutral losses
QF4. Isotopic abundance accuracy
QF5. The number of peaks in a spectrum
QF6. Lower mass limit of the spectrum
QF7. Sample Purity
QF8. Calibration date
QF9. Similarity Index of calibration mass spectrum
Details of the method for determining the QI from QF's
can be found elsewhere [6]. Only a few points will be noted
here. The first is that the NBS QI procedure uses these nine
(9) factors. whereas the McLafferty QI uses only the first
six (6) of these, and has added a seventh QF, which is
called the source of the spectrum.
The second point is the last three (3) QF's are based upon
experience gained in developing a contract by the US EPA
for obtaining new mass spectra. The cost of running some
1000 new spectra a year has been found to be almost $250
per spectrum. As much of this cost ($61) is to acquire and
purify the sample, and lab overhead which includes
calibration ($130), these additional QF's were considered
important enough to modify the original method used to
calculate the QI. As the Wiley effort does not involve any
activities in running new spectra, these QF's were not
added to their QI calculation. -
The last point to be made is the QF9, which is the
quality of the reference spectrum is a very important
factor for ensuring only the best data is added to the
database. What is done to obtain this QF is that at the
time of calibration, the calibration spectrum is stored and
similarity between it and the standard library spectrum of
the compound [bis-(pentafluoropo-phenyl) phenyl phosphine
is computed by the Similarity Index program within the Mass
Spectral Search System (MSSS) [1] of the NIH/EPA Chemical
Information System (CIS). This number, which lies between
zero and one, becomes QF9, which is an indicator of
spectrometer performance.
All the quality factors are automatically calculated by means of a computer program which also computes the Quality Index (QI) for each spectrum. Then whenever spectra associated with the same CAS Registry number are encountered, the one with the highest QI is retained, and the remaining spectra are put into an archive file. When this process was completed with one version of the database, about 22% of the entire database was consigned to the archive file. For the spectra in the current NBS database, the average QI is slightly over 500. With spectra such as these, both-QF8 and QF9, which relate to calibration of the zzzzzzzzzzzzzzzz
or slightly under 4% of the entire working database had a
QI of zero. When the 1353 spectra were examined in some
detail, the reasons for the assignment of a zero for the QI
emerged from a few of the Quality Factors (QF). The QF's
which most often caused the QI to be zero (remember- the QI
is a multiplictive result, so any QF which is zero
automatically means that QI will be zero) were the lowest
mass reported and the impurity peaks greater than the
molecular ion. While the lowest mass value has no real
bearing on the correctness of a spectrum, it does bear very
heavily on the usefulness of the spectrum, and the constant
need to remind scientists to report complete data.
Scientists are tending to report less and less raw data
(with the clear approval of journal editors who seem more
concerned about economics (printing costs) than science).
and more often are beginning to select the data which
supports their explanation or interpretation of the
experimental data. This is clearly not in the best interest
of the scientific community. Hopefully with the QF it will
occur less in mass spectral data reporting in the future.
Again, the reader should remember the QI is not really an
indicator that a spectrum is good. Rather it is an
indication of the problems with a spectrum. The QI is
more reliable in telling a scientist that a spectrum is
poor, which means from the above QF's that the spectrum
is not correct and/or lacking in certain areas.
MASS SPECTRAL SEARCH SYSTEMS
The mass spectral library search, started in the 1960's
continues to be of interest to many research groups
around the world. There are probably dozens of different
approaches to mass spectral library searching, and a few
hundred papers written on the subject. While there
numbers have been decreasing in the last few years, it is
not likely to end. Minor enhancements and slight
refinements will continue, and hence additional
publications will result.
It is not the purpose of this chapter to go over this
long history of library search systems, rather it is
desired to highlight the major ones being used today, and
briefly discuss some of the more recent research results.
The reader is referred elsewhere for more detailed
presentations and reviews of library search systems
[7,8].
There are four main computer systems in which library search
systems are to found. They are:
1. Large time-sharing systems
2. Dedicated lab or mini-computer systems
3. Instrument manufacturer computer data systems
4. Microcomputer library search systems
One can argue that 2 and 4 are the same, or will be, as microcomputers grow in size, speed, capability, and readily
available large disk storage. Thus only three main areas will
be covered.
For the large time-sharing systems there is really only
one system, the Mass Spectral Search System (MSSS) which is
part of a larger Chemical Information System (CIS], developed
by the US Government from about 1970 - 1984 [9]. The MSSS was
first made available to the public in the fall of 1972, when
it was introduced to the mass spectrometry community at the
International Mass Spectrometry meeting in Scotland, via the
General Electric (GE) computer system and corresponding GE
telecommunications network. The original MSSS was sponsored
by the UK Government, which later bowed out and the US
Government took over the running and support of this system.
The MSSS has the most extensive list of search and plotting
options of any mass spectral search system on any computer
system. It was meant to serve as broad a community of mass
spectrometrists as possible. Since the system used a large
time-sharing computer, disk space was not the issue it was on
lab computer or instrument data systems. Thus, considerable
capabilities could easily be built and made available to the
user. However, as lab computer grew in size and capability,
and their costs began to decrease, the MSSS became less
powerful in a relative sense. Furthermore, the decision of
the US Government, via the National Bureau of Standards -
Office of Standard Reference Data (as discussed above), to
distribute the database to mass spectrometer manufacturers,
led to a considerable decline in the usage of the MSSS.
Notwithstanding all of these developments. coupled with the
NBS publishing what is now a six (6) volume set of books of
the mass spectral database, the MSSS is alive and running.
and being used by many scientists on a regular daily basis.
The MSSS main search is a variation of a procedure devised by
Hertz, Hites and Biemann [10] and redesigned by Heller [1]
for use in an online time-sharing system.
Searches through the MSSS database can be carried out in a
number of ways. With the mass spectrum of an unknown
substance in hand, the search can be conducted
interactively, as is shown in Figure 1. In this search the
user finds that 91 database spectra have a peak (minimum
intensity 60%, maximum intensity 100%) at an m~z value of
224. When this subset is examined for spectra containing a
peak at m/z 207 with intensity of between 80 and 100%, only
3 spectra are found. The entering of a third peak, at adz
value of 73 (with an intensity between 10 and 40%) narrows
the search down to just 1 answer. which is then printed
out. In the example shown, the answer "2,3,6 trichloro
benzoic acid" is shown with a number of synonyms used in
naming this chemical. as well as other identifying
information. If there still had been a large number of
answers after entering the three peaks used in this
example, the search could have been reduced further to a
manageable number of spectra by entering further peaks. In
addition, the database can be examined for all occurrences
of a specific molecular weight or a partial or complete
molecular formula. Combinations of these properties can
also be used in searches. Thus, all compounds containing,
for example, five chlorines and whose mass spectra have a
base peak at a particular m/z value can be identified.
In contrast to these interactive searches, which are of
little appeal to those with large numbers of searches to
carry out, there are available two batch-type searches
which accept the complete spectrum of the unknown substance
and examine all spectra in the file sequentially to find
the best fits. These are the KB (toward search) [10] and
PBM (reverse search) search algorithms [11,12,13]. Spectra
can be entered from a teletype; but in a more powerful
approach, a user's data System can be connected to the
network end the unknown spectra down-loaded into the
network computer for searching. An example of a Biemann
(KB) search is given in Figure 2. The search is for Dioxin,
and the data entered are underlined in the figure. The
result of the search are three spectra with similarity
values greater than 0.18. of the three, the first, which is
dioxin, has the highest similarity index (SI). Once an
identification has been made and the name and CAS Registry
Number of the database compound are reported to the user,
the database spectrum can be listed or, if a CRT terminal
is being used. plotted, to facilitate direct comparison of
the unknown and standard spectra.
Before ending the area of mass spectral search Systems one should note that today virtually even mass spectrometer which runs electron impact (EI) spectra has both a search
program and a database provided as part of t he system package. The search programs are usually variations of the Biemann and McLafferty PBM algorithm search routines. The database is usually the NBS database, although not usually the latest version. The reason for the database not be the latest version is twofold. Firstly, not update of a System disk that often. Secondly, and more critical, not many disk systems installed on old, and even new, computer systems have the sufficient disk capacity for the entire library of over 40,000 spectra. Even when the entire
library is installed on a manufacturers data system, one
soon discovers that some of the original database is
missing. In particular the information normally left out.
owing to space limitations on the disk, are many (if not
all) of the chemical names and synonyms, and details of the
source of the spectrum. Thus going to an online system for
complete details may be necessary. (As for incomplete data,
it is useful to mention that the six volume set of books
published by the US Government Printing Office does not
have spectral source information, such as is shown in
Figure 3.) An example of a typical plot of a spectrum is
given in Figure 3, while a sample page from the six volume
set of books in given in Figure 4. In Figure 3, the
spectrum of 2,3,6 trichloro benzoic acid (the result of the
search in Figure 1) is plotted out on an expanded scale.
Recent Activities in Library Searching
One important aspect of library searching which is
attracting continued attention is that of how to analyze
and search for compounds found in mixtures. The. PBM
method, mentioned before is one good approach, although it
has problems when some components are found in large
amounts, and others are found in only much smaller or trace
amounts. PBM does best when the chemicals in the mixture
are of roughly equal proportions, which is not always the
case in real life problems, such as dump sites and polluted
waters. The most recent of the McLafferty papers
(stretching over a decade) in search of fine tuning the
ultimate search program, is one which deals with further
improvements in the statistical reliability of predicted
matches [13]. The result of this latest work indicates that
they are able to provide a quantitative measure of the
predicted reliability of a given spectral match. In
addition work was presented which improved the procedures
for taking into account the variation in peak abundances
caused by mass discrimination and change in sample
concentration often found during GC runs.
A recent article by scientists at an EPA lab presents a
system of computer programs for recognizing impure or mixed
spectra and automatically subtracting reference mass
spectra of a chemical in the mixture from the spectrum of
the mixture [141. This spectrum subtraction would have
considerable use in enhancing the ability of computer
library search programs to match components of a
multicomponent mixture correctly. given the problems of
current programs, such the PBM system mentioned above. In
addition a set of quality factors were used to help
evaluate the overall validity of the spectrum library
match.
Lastly, a study by research group at Boston University
has proposed a method to evaluate library searching systems. The evaluation procedure is called Quantitative Evaluation of Library Searching (QELS) [15]. The method compares hit-lists obtained with trial conditions (e.g.. compressed spectra) to hit-lists from a successful search system. While this approach has been used for infrared (IR) library searching. it should be valid for mass spectral library searching, and it would be of considerable use to the practicing spectroscopist if such a evaluation method were available, particularly one developed by a group which has no vested interest in existing search methods.
SUMMARY
It is hoped that the reader has now sufficient background to understand the nature of and content of mass spectral databases which are now available, either on a mass spectrometry data system, a magnetic tape of spectral data, or an online system. The most important point to get from this chapter is the size and actual quality of the mass spectral databases are small (60,000 spectra out of over 7,000,000 reported chemicals is well under 1% of known chemicals) and of not the highest quality. However, what you have read about here is what you can get, so it is best to learn to work with it. A critical point, which all scientists should remember, but most often forget, is that structure elucidation is not founded upon one technique. Mass spectral data are very valuable, but not absolute and not unique. Other confirming evidence, whether it be chemical or spectral (e.g., IR, CNMR, and so forth) is absolutely necessary for good science. One reason there continues to be further work in library search system to
fine tune them an squeeze out the last drop of information,
is simple. Mass spectral data alone is not enough, but some
still try to make it so. Good scientists use all the tools
that are available to solve a problem. In most cases this
means more than mass spectrometry.
REFERENCES
1. S. R. Heller, Anal. Chem., 44, 1951 (1972~; G. W. A.
Milne, S. R. Heller, R. S. Heller, and D. P. Martinsen, Adv.
Mass Spectrom., 8B1578 (1980); S. R. Heller, Kemia-Kemi, #1,
pages 15-16 (1984).
2. The NIH/EPA/MSDC database is available for lease on
computer tape from the U.S. National Bureau of Standards
(NBS), Office of Standard Reference Data, Physics Building,
Room A-320, Gaithersburg, Maryland 20899. USA. (Telephone
301-921-2228). The database is also available in printed form
(currently six volumes and an index volume). The six volumes
are available from the US Government Printing Office,
Wsahington, DC 20402. The MS books are available as the first
four volume set (stock number 003-003-01987-9), Supplement
Number 1 (stock number 003-003002268-3), and Supplement
Number 2 (stock number 003-003-02514-3). For prices and
details on how to order, please contact the Government
Printing Office.
3. The Wiley/NBS Mass Spectral Database is available from
John Wiley & Sons, Electronic Publishing Division, 605 Third
Avenue, New York, NY 10158.
4. Mass Spectrometry Data Centre, UKCIS, The University,
Nottingham NG7 2RD, UK.
5. J. G. Dillard, S. R. Heller, F. W. McLafferty, G. W. A.
Milne and R. Venkataraghavan, Org. Mass. Specr., 16, 48-49(1981).
6. G. W. A. Milne, W. L. Budde, S. R. Heller, D. P. Martinsen
and R. G. Oldham, Org. Mass Spec., 17, 547-552 (1982).
7. D. P. Martinsen, Appl. Spectrosc., 35, 255 (1981).
8. (Finnigan MAT) SPECTRA, Yolume 10, Number 1, 1984.
9. S. R. Heller, J. Info. Processing and Management, 27, 19
(1984); S. R. Heller, Drexel Library Quarterly, 18, t3 & 4,
39 (1982); G. W. A. Milne, R. Potenzone Jr.. and S. R.
Heller, Science, 215, 371 (1982).
10. H. S. Hertz, R. A. Hites, and K. Biemann, Anal. 681
(1971).
11. F. W. McLafferty, R. H. Hertel, and R. D. Villwock,
Spectrom., 9, 690 (1974).
12. G. M. Pesyna, R. Venkataraghaven, H. E. Dsyringer,
McLafferty, Anal. Chem., 51, 1945 (1979).
13. B. L. Atwater, D. B. Stauffer, F. W. McLafferty,
and D. W. Peterson, Anal. Chem., 57, 899 (1985).
14. W. M. Shackelford and D. M. Cline, Anal. Chim. Acta, 164,
251 (1985).
15. J. R. Hallowell and M. Delaney, Trends Anal. Chem., 4,
#3, IV-VII, (1985).
TITLES FOR TABLE AND FIGURES
Table 1: Summary of EI Mass Spectral Databases
Figure 1: Typical MOSS PEAK search.
Figure 2: Typical Search using the Biemann Search Procedure.
Figure 3: Plot of Compound identified in PEAK search in
Figure 1.
Figure 4: Sample page from the six volume set of mass
spectral books.