Mass Spectrometry Databases and Search Systems

Stephen R. Heller

Agricultural Research Service

US Department of Agriculture

Beltsville, MD 20705 USA

This chapter will provide the reader with a discussion of mass spectrometry databases and some examples of library search systems used in mass spectrometry.

The development of mass spectral databases started in the 1940's with the American Petroleum Institute (API) Project 44 activities. The reason that mass spectrometry database activity goes back so far is, no doubt, due to the nature of mass spectral data. The mass spectrum of a chemical produces data which are ideally suited for representation and manipulation in digital form. Compared to Infrared (IR) and Nuclear Magnetic Resonance (NMR) spectral data, mass spectra are the ultimate of simplicity. Just peaks and intensities. However, this is not to say the data are simple or easy to understand or interpret or correlate.

While the API Project 44 continued on over the years, other groups began to initiate their own mass spectral data collections. It was not until 1965 that the British Government initiated funding their Atomic Weapons Research Establishment (AWRE) at Aldermaston for the purpose of creating a world-wide database of mass spectra. This project funded a group which became known at the Mass Spectrometry Data Centre (MSDC) at Aldermaston. A few years later, the US National Insitutes of Health (NIH) Laboratory of Chemistry, which was heavily involved in mass spectrometry, began the development of a computer based library retrieval system using this MSDC database and one provided by Professor Biemann at MIT.

As the computer system developed (and this will be discussed in detail later in this chapter), it became clear there were a number of problems with the database, both in quantity and quality. It is likely that these issues were noticeable only due to the fact the database was actually being used every day in an online system by practicing mass spectrometrists. The result of this computer retrieval project led the NIH, and in later years the US Environmental Protection Agency (EPA), along with the US National Bureau of Standards (NBS), and the US Food and Drug Administration (FDA) to begin to work with the MSDC in enlarging the database and bringing quality assurance and quality control into the database activity project [1].

In addition to these Anglo-American efforts, a second major effort was initiated by Stenhagen and Abrahamsson in Sweden. This effort was later joined by McLafferty. As When Stenhagen, and then later Abrahamsson died, McLafferty took over this database development and maintenance. Today this database is known as the Mass Spectrometry Registry and is distributed by the publisher, John Wiley and Sons.

Other activities in database development have taken place at the Atomic Energy Laboratory in Grenable, France under the direction of Cornu and Massot. A small database of 2000 mass spectra of chemicals of biological interest was compiled by Markey. Cairns and Jacobson of the US FDA compiled a database of some 2000 mass spectra of pesticides and industrial chemicals. Sorenson at Agriculture Canada compiled 300 mass spectra of drugs used in horse racing. The API Project 44 collection continued for many years under the direction of Zwolinski at Texas A&M as part of the Thermodynamics Research Center data collection activities. Shackelford at the US EPA collected a database of some 1500 mass spectra of pollutants which had been found in water analyses. Ryhage at the Karolinska Institute in Sweden collected about 2500 mass spectra of chemicals studied in research activities in this research center in Stockholm.

The list goes on. but by now the reader should easily see that mass spectrometry data collection was very much a cottage industry for the most part. with just two major efforts. These two efforts were the US-UK group and the McLafferty group. Over the past decade this has remained the case' and today there are two major collections of mass spectral databases in the world. Of course there are many mass spectral database collections which can be found in industrial labs throughout the world. However, these collections, a number of which are reported to contain over 100,000 spectra of different compounds (such as would be expected in the flavor and fragrance industry) will never see the light of day, owing to the need for corporate secrecy. Because of intense concern over trade secrets and competition in many industries, corporate lawyers see no reason to be generous and donate useful Mass spectral data to the scientific community.

The first of these two major efforts to be discussed in some detailed is the NBS mass spectral databases which contains some 43,000 mass spectra of an equal number of chemicals. Only one spectrum per compound is to be found in this database. All multi (but not necessarily exact duplicate) spectra have been removed, by a process described later. All labeled compounds have been removed from the database, so no deuterium, and the like spectra will be found. Each spectrum has had a Chemical Abstracts Service (CAS) Registry number assigned to the chemical which produced the spectrum. Each chemical has a CAS names, as well as many other names as could be found, both formal (i.e., IUPAC), trival, english and foreign language (but not Japanese, Arabic, cryllic. and so forth). Each spectrum has a quality index (QI), which ranges from O to 999, calculated and assigned to the spectrum. When a new spectrum for the same chemical is received, a QI is calculated and compared to the one already in the file. If the new QI is higher than the current QI, the new spectrum replaces the current spectrum in the database, and the current spectrum is placed in an archive file. This archive file, which is not available at the present to users, contains well over 75,000 spectra. and includes all the multiple copies of spectra and all labeled spectra [2].

The Stenhagen, Abrahamsson, and McLafferty mass spectral database, hereafter called the Wiley database, is similar to the NBS database in many ways. The database is larger, containing some 80,000 spectra. The main reason for this is that the Wiley collection includes multiple copies

of the mass spectrum of a chemical when the spectra are different. (Different as determined qualitatively by the author of the database.) The database also includes the spectra of labeled chemicals which have been left out of the NBS collection. The Wiley collection uses the Wissesser Line Notation (WLN) as the method of trying to uniquely identify the structure of the chemical associated with each spectrum. I use the word "try" very

correctly, since WLN in not a canonical notation.

A canonical notation is one that will produce a unique structure from a given structure representation. A WLN notation, used to represent a chemical structure, can and does give rise to more than one structure. That is, two different structures can and do have the same WLN. For this reason modern structure representation systems no longer use WLN as their primary structure representation. Today, connection tables are used for structure representation. It should be noted that while the NBS database has a CAS Registry number for every entry, the Wiley database does not. Somewhat over 2/3 of the Wiley database has CAS Registry numbers. The situation for the WLN in the Wiley collection is somewhat worse. There are WLN structure notations for slightly over 50% of the spectra in the database. The NBS database, which uses the CAS Registry number (and associated connection table Structure record), does not contain the WLN, except as a synonym along with other chemical names. In addition, the Wiley collection has a QI for every spectrum, although the method use to calculate the QI differs slightly from the one used by the NBS project [3].

Before leaving the issue of databases, it is worthwile to mention a third database from the MSDC, which is their eight peak spectra database. As the name implies the database is comprised of the eight largest peaks in each spectrum. not the entire spectrum. (Of course, if a spectrum consists of eight or less peaks, the spectrum in the MSDC database will be the complete spectrum. Besides

ethane, water, and a few other very simple , this is not the case.) The 8 peak database from MSDC contains some 70,000 spectra, including duplicates, and is available from the MSDC. The older MSDC complete spectra are also available [4]. A summary of the electron impact (EI) mass spectra databases is given in Table 1.

QUALITY CONTROL/QUALITY EVALUATION

A obvious concern of the scientific community regarding these mass spectral databases is over the quality of the spectra contained in the files. As the US National Bureau of Standards (NBS), Office of Standard Reference Data (OSRD) wqs one of the early participants and sponsors of the one of the major database efforts in mass spectrometry, this issue arose early. Methods were quickly devised to control the quality of the chemical nomenclature and structure associated with each spectrum. The Chemical Abstracts Service (CAS) Registry number, a sort of social security number for a chemical, was accepted as the unique identifier, and the CAS nomenclature used as the primary name. In the development of the method or algorithm used to determine the quality of a spectrum, a semi-qualitative method was devised. as no absolute measurement of a mass spectrum is known [5,6].

In 1974 the US-UK group decided to remove redundant- or multiple copies of spectra from the file. This decision was reached as it was felt by most everyone that they served little purpose and were taking up valuable storage space and computer search time. The name(s) of every compound in file were sent to Chemical Abstracts Service where, under contract to the US EPA, CAS identified the CAS Registry number for a compound. The first step in the process was to perform a simple name match. When this did not succeed, the structure of the chemical was matched against the structure in the CAS file of a few million chemicals. If this second Step failed, then it was determined that the chemical was not in the CAS file (which numbered some 4-5 million at the time), and a new CAS Registry number was assigned to the chemical.

When this CAS registration step was complete the next step was to devise a method to decide which of several spectra in a number of cases was the best one. The approach used was to use the experience of practicing mass spectrometrists. As the mass spectromety of organic

compounds developed during the 1960's and early 1970's, spectrometrists became familiar with the types of errors that occur frequently in recorded mass spectra. Responses ranging from modification of experimental procedures to redesign of spectrometers were adopted to eliminate or minimize these errors. The result is that a conscientious analyst using a modern mass spectrometer can produce mass spectra which rarely, if ever, contain such errors. Thus the US EPA funded a project to develop an algorithm which examines a mass spectrum for the occurrence of such standard errors. The program computes a number, which its called the Quality Index (QI), and is a measure or indicator of the quality - in terms of the absence of standard errors - of the spectrum.

The QI algorithm, employee seven (7) quality factors (QF), each having a value between zero (0) and one (1).

Multiplication together of all these quality factors and further multiplication of the product by 1000 leads to the quality index (QI) for the spectrum. The quality factors now being used by the NBS Office of Standard Reference Data are:

QF1. The electron voltage

QF2. Peaks above the molecular weight

QF3. Illogical neutral losses

QF4. Isotopic abundance accuracy

QF5. The number of peaks in a spectrum

QF6. Lower mass limit of the spectrum

QF7. Sample Purity

QF8. Calibration date

QF9. Similarity Index of calibration mass spectrum

Details of the method for determining the QI from QF's can be found elsewhere [6]. Only a few points will be noted here. The first is that the NBS QI procedure uses these nine (9) factors. whereas the McLafferty QI uses only the first six (6) of these, and has added a seventh QF, which is called the source of the spectrum.

The second point is the last three (3) QF's are based upon experience gained in developing a contract by the US EPA for obtaining new mass spectra. The cost of running some 1000 new spectra a year has been found to be almost $250 per spectrum. As much of this cost ($61) is to acquire and purify the sample, and lab overhead which includes calibration ($130), these additional QF's were considered important enough to modify the original method used to calculate the QI. As the Wiley effort does not involve any activities in running new spectra, these QF's were not added to their QI calculation. -

The last point to be made is the QF9, which is the quality of the reference spectrum is a very important factor for ensuring only the best data is added to the database. What is done to obtain this QF is that at the time of calibration, the calibration spectrum is stored and similarity between it and the standard library spectrum of the compound [bis-(pentafluoropo-phenyl) phenyl phosphine is computed by the Similarity Index program within the Mass Spectral Search System (MSSS) [1] of the NIH/EPA Chemical Information System (CIS). This number, which lies between zero and one, becomes QF9, which is an indicator of spectrometer performance.

All the quality factors are automatically calculated by means of a computer program which also computes the Quality Index (QI) for each spectrum. Then whenever spectra associated with the same CAS Registry number are encountered, the one with the highest QI is retained, and the remaining spectra are put into an archive file. When this process was completed with one version of the database, about 22% of the entire database was consigned to the archive file. For the spectra in the current NBS database, the average QI is slightly over 500. With spectra such as these, both-QF8 and QF9, which relate to calibration of the zzzzzzzzzzzzzzzz

or slightly under 4% of the entire working database had a QI of zero. When the 1353 spectra were examined in some detail, the reasons for the assignment of a zero for the QI emerged from a few of the Quality Factors (QF). The QF's which most often caused the QI to be zero (remember- the QI is a multiplictive result, so any QF which is zero automatically means that QI will be zero) were the lowest mass reported and the impurity peaks greater than the molecular ion. While the lowest mass value has no real bearing on the correctness of a spectrum, it does bear very heavily on the usefulness of the spectrum, and the constant need to remind scientists to report complete data. Scientists are tending to report less and less raw data (with the clear approval of journal editors who seem more concerned about economics (printing costs) than science). and more often are beginning to select the data which supports their explanation or interpretation of the experimental data. This is clearly not in the best interest of the scientific community. Hopefully with the QF it will occur less in mass spectral data reporting in the future.

Again, the reader should remember the QI is not really an indicator that a spectrum is good. Rather it is an indication of the problems with a spectrum. The QI is more reliable in telling a scientist that a spectrum is poor, which means from the above QF's that the spectrum is not correct and/or lacking in certain areas.

MASS SPECTRAL SEARCH SYSTEMS

The mass spectral library search, started in the 1960's continues to be of interest to many research groups around the world. There are probably dozens of different approaches to mass spectral library searching, and a few hundred papers written on the subject. While there numbers have been decreasing in the last few years, it is not likely to end. Minor enhancements and slight refinements will continue, and hence additional publications will result.

It is not the purpose of this chapter to go over this long history of library search systems, rather it is desired to highlight the major ones being used today, and briefly discuss some of the more recent research results. The reader is referred elsewhere for more detailed presentations and reviews of library search systems [7,8].

There are four main computer systems in which library search systems are to found. They are:

1. Large time-sharing systems

2. Dedicated lab or mini-computer systems

3. Instrument manufacturer computer data systems

4. Microcomputer library search systems

One can argue that 2 and 4 are the same, or will be, as microcomputers grow in size, speed, capability, and readily available large disk storage. Thus only three main areas will be covered.

For the large time-sharing systems there is really only one system, the Mass Spectral Search System (MSSS) which is part of a larger Chemical Information System (CIS], developed by the US Government from about 1970 - 1984 [9]. The MSSS was first made available to the public in the fall of 1972, when it was introduced to the mass spectrometry community at the International Mass Spectrometry meeting in Scotland, via the General Electric (GE) computer system and corresponding GE telecommunications network. The original MSSS was sponsored by the UK Government, which later bowed out and the US Government took over the running and support of this system. The MSSS has the most extensive list of search and plotting options of any mass spectral search system on any computer system. It was meant to serve as broad a community of mass spectrometrists as possible. Since the system used a large time-sharing computer, disk space was not the issue it was on lab computer or instrument data systems. Thus, considerable capabilities could easily be built and made available to the user. However, as lab computer grew in size and capability, and their costs began to decrease, the MSSS became less powerful in a relative sense. Furthermore, the decision of the US Government, via the National Bureau of Standards - Office of Standard Reference Data (as discussed above), to distribute the database to mass spectrometer manufacturers, led to a considerable decline in the usage of the MSSS.

Notwithstanding all of these developments. coupled with the NBS publishing what is now a six (6) volume set of books of the mass spectral database, the MSSS is alive and running. and being used by many scientists on a regular daily basis. The MSSS main search is a variation of a procedure devised by Hertz, Hites and Biemann [10] and redesigned by Heller [1] for use in an online time-sharing system.

Searches through the MSSS database can be carried out in a number of ways. With the mass spectrum of an unknown substance in hand, the search can be conducted interactively, as is shown in Figure 1. In this search the user finds that 91 database spectra have a peak (minimum intensity 60%, maximum intensity 100%) at an m~z value of 224. When this subset is examined for spectra containing a peak at m/z 207 with intensity of between 80 and 100%, only 3 spectra are found. The entering of a third peak, at adz value of 73 (with an intensity between 10 and 40%) narrows the search down to just 1 answer. which is then printed out. In the example shown, the answer "2,3,6 trichloro benzoic acid" is shown with a number of synonyms used in naming this chemical. as well as other identifying information. If there still had been a large number of answers after entering the three peaks used in this example, the search could have been reduced further to a manageable number of spectra by entering further peaks. In addition, the database can be examined for all occurrences of a specific molecular weight or a partial or complete molecular formula. Combinations of these properties can also be used in searches. Thus, all compounds containing, for example, five chlorines and whose mass spectra have a base peak at a particular m/z value can be identified.

In contrast to these interactive searches, which are of little appeal to those with large numbers of searches to carry out, there are available two batch-type searches which accept the complete spectrum of the unknown substance and examine all spectra in the file sequentially to find the best fits. These are the KB (toward search) [10] and PBM (reverse search) search algorithms [11,12,13]. Spectra can be entered from a teletype; but in a more powerful approach, a user's data System can be connected to the network end the unknown spectra down-loaded into the network computer for searching. An example of a Biemann (KB) search is given in Figure 2. The search is for Dioxin, and the data entered are underlined in the figure. The result of the search are three spectra with similarity values greater than 0.18. of the three, the first, which is dioxin, has the highest similarity index (SI). Once an identification has been made and the name and CAS Registry Number of the database compound are reported to the user, the database spectrum can be listed or, if a CRT terminal is being used. plotted, to facilitate direct comparison of the unknown and standard spectra.

Before ending the area of mass spectral search Systems one should note that today virtually even mass spectrometer which runs electron impact (EI) spectra has both a search

program and a database provided as part of t he system package. The search programs are usually variations of the Biemann and McLafferty PBM algorithm search routines. The database is usually the NBS database, although not usually the latest version. The reason for the database not be the latest version is twofold. Firstly, not update of a System disk that often. Secondly, and more critical, not many disk systems installed on old, and even new, computer systems have the sufficient disk capacity for the entire library of over 40,000 spectra. Even when the entire

library is installed on a manufacturers data system, one soon discovers that some of the original database is missing. In particular the information normally left out. owing to space limitations on the disk, are many (if not all) of the chemical names and synonyms, and details of the source of the spectrum. Thus going to an online system for complete details may be necessary. (As for incomplete data, it is useful to mention that the six volume set of books published by the US Government Printing Office does not have spectral source information, such as is shown in Figure 3.) An example of a typical plot of a spectrum is given in Figure 3, while a sample page from the six volume set of books in given in Figure 4. In Figure 3, the spectrum of 2,3,6 trichloro benzoic acid (the result of the search in Figure 1) is plotted out on an expanded scale.

Recent Activities in Library Searching

One important aspect of library searching which is attracting continued attention is that of how to analyze and search for compounds found in mixtures. The. PBM method, mentioned before is one good approach, although it has problems when some components are found in large amounts, and others are found in only much smaller or trace amounts. PBM does best when the chemicals in the mixture are of roughly equal proportions, which is not always the case in real life problems, such as dump sites and polluted waters. The most recent of the McLafferty papers (stretching over a decade) in search of fine tuning the ultimate search program, is one which deals with further improvements in the statistical reliability of predicted matches [13]. The result of this latest work indicates that they are able to provide a quantitative measure of the predicted reliability of a given spectral match. In addition work was presented which improved the procedures for taking into account the variation in peak abundances caused by mass discrimination and change in sample concentration often found during GC runs.

A recent article by scientists at an EPA lab presents a system of computer programs for recognizing impure or mixed spectra and automatically subtracting reference mass spectra of a chemical in the mixture from the spectrum of the mixture [141. This spectrum subtraction would have considerable use in enhancing the ability of computer library search programs to match components of a multicomponent mixture correctly. given the problems of current programs, such the PBM system mentioned above. In addition a set of quality factors were used to help evaluate the overall validity of the spectrum library match.

Lastly, a study by research group at Boston University

has proposed a method to evaluate library searching systems. The evaluation procedure is called Quantitative Evaluation of Library Searching (QELS) [15]. The method compares hit-lists obtained with trial conditions (e.g.. compressed spectra) to hit-lists from a successful search system. While this approach has been used for infrared (IR) library searching. it should be valid for mass spectral library searching, and it would be of considerable use to the practicing spectroscopist if such a evaluation method were available, particularly one developed by a group which has no vested interest in existing search methods.

SUMMARY

It is hoped that the reader has now sufficient background to understand the nature of and content of mass spectral databases which are now available, either on a mass spectrometry data system, a magnetic tape of spectral data, or an online system. The most important point to get from this chapter is the size and actual quality of the mass spectral databases are small (60,000 spectra out of over 7,000,000 reported chemicals is well under 1% of known chemicals) and of not the highest quality. However, what you have read about here is what you can get, so it is best to learn to work with it. A critical point, which all scientists should remember, but most often forget, is that structure elucidation is not founded upon one technique. Mass spectral data are very valuable, but not absolute and not unique. Other confirming evidence, whether it be chemical or spectral (e.g., IR, CNMR, and so forth) is absolutely necessary for good science. One reason there continues to be further work in library search system to

fine tune them an squeeze out the last drop of information,

is simple. Mass spectral data alone is not enough, but some still try to make it so. Good scientists use all the tools that are available to solve a problem. In most cases this means more than mass spectrometry.

REFERENCES

1. S. R. Heller, Anal. Chem., 44, 1951 (1972~; G. W. A. Milne, S. R. Heller, R. S. Heller, and D. P. Martinsen, Adv. Mass Spectrom., 8B1578 (1980); S. R. Heller, Kemia-Kemi, #1, pages 15-16 (1984).

2. The NIH/EPA/MSDC database is available for lease on computer tape from the U.S. National Bureau of Standards (NBS), Office of Standard Reference Data, Physics Building, Room A-320, Gaithersburg, Maryland 20899. USA. (Telephone 301-921-2228). The database is also available in printed form (currently six volumes and an index volume). The six volumes are available from the US Government Printing Office, Wsahington, DC 20402. The MS books are available as the first four volume set (stock number 003-003-01987-9), Supplement Number 1 (stock number 003-003002268-3), and Supplement Number 2 (stock number 003-003-02514-3). For prices and details on how to order, please contact the Government Printing Office.

3. The Wiley/NBS Mass Spectral Database is available from John Wiley & Sons, Electronic Publishing Division, 605 Third Avenue, New York, NY 10158.

4. Mass Spectrometry Data Centre, UKCIS, The University, Nottingham NG7 2RD, UK.

5. J. G. Dillard, S. R. Heller, F. W. McLafferty, G. W. A. Milne and R. Venkataraghavan, Org. Mass. Specr., 16, 48-49(1981).

6. G. W. A. Milne, W. L. Budde, S. R. Heller, D. P. Martinsen and R. G. Oldham, Org. Mass Spec., 17, 547-552 (1982).

7. D. P. Martinsen, Appl. Spectrosc., 35, 255 (1981).

8. (Finnigan MAT) SPECTRA, Yolume 10, Number 1, 1984.

9. S. R. Heller, J. Info. Processing and Management, 27, 19 (1984); S. R. Heller, Drexel Library Quarterly, 18, t3 & 4, 39 (1982); G. W. A. Milne, R. Potenzone Jr.. and S. R. Heller, Science, 215, 371 (1982).

10. H. S. Hertz, R. A. Hites, and K. Biemann, Anal. 681 (1971).

11. F. W. McLafferty, R. H. Hertel, and R. D. Villwock, Spectrom., 9, 690 (1974).

12. G. M. Pesyna, R. Venkataraghaven, H. E. Dsyringer, McLafferty, Anal. Chem., 51, 1945 (1979).

13. B. L. Atwater, D. B. Stauffer, F. W. McLafferty,

and D. W. Peterson, Anal. Chem., 57, 899 (1985).

14. W. M. Shackelford and D. M. Cline, Anal. Chim. Acta, 164, 251 (1985).

15. J. R. Hallowell and M. Delaney, Trends Anal. Chem., 4, #3, IV-VII, (1985).

TITLES FOR TABLE AND FIGURES

Table 1: Summary of EI Mass Spectral Databases

Figure 1: Typical MOSS PEAK search.

Figure 2: Typical Search using the Biemann Search Procedure.

Figure 3: Plot of Compound identified in PEAK search in Figure 1.

Figure 4: Sample page from the six volume set of mass spectral books.