The MSDC/EPA/NIH Mass

Spectral Search System

By G. W. A. Milne and S. R. Heller

A large file of mass spectral data has been assemble in a collaborative effort involving the U.S. Environmental Protection Agency (EPA), the U.S. National Institutes of Health (NIH), and the U.K. Mass Spectrometry Data Centre (MSDC). This file, together with the programs for searching through it, has been made available to the international scientific community via a time-sharing computer network. This "Mass Spectral Search System" (MSSS) has been in operation for more than four years, during which time an estimated 30,000 searches have been completed. Currently, over 200 working accounts are using the system on a daily basis.

Background

The last few years have seen the gradual development primarily at NIH and EPA of a Chemical Information System (CIS) on an interactive time-sharing DEC PDP-10 which contains several collections of data such as mass spectra, carbon nuclear magnetic resonance spectra, and x-ray diffraction data. In addition to these data bases, the CIS also contains a battery of data base searching programs, literature searching programs and programs that serve to provide structure-literature links. Of all these components, the MSSS is the most highly developed and has been operating on a quasi-commercial basis for three years. It is available on a fee-for-service basis via the Cyphernetics computer network which can be reached by a local telephone call throughout most of North America and Western Europe.

The purpose of this article is to describe the MSSS and also to provide a report on the status of the system and some of our experiences with it during the last three years.

Methods of searching

As a practical matter, it is preferable that the time necessary to search through a data base be largely independent of the size of the data base. In the case of the MSSS, this is accomplished using an inverted file technique with a data base of "abbreviated" mass spectra. These are spectra in which only the two most intense peaks in consecutive gaps of 14 amu are retained. The resulting file is only about 30% the size of the full file, but, as has been shown by Biemann and his group at MIT, contains essentially all the information that was in the unabbreviated spectra. Although the abbreviate file is used for efficient searching, the full file is also stored in the computer in order that users may retrieve complete mass spectra to confirm identifications. There is a certain redundancy in this practice, and we have, therefore, experimented with computer-generated microfiche of the complete mass spectra with a view to obviating the need for storage of the complete mass spectra. Acceptance of microfiche is, however, far from general, and the spectra are still, therefore, stored on both these media.

Searching through this mass spectra data base can be accomplished in a variety of ways. These programs are summarized in Figure 1. Probably the most important of these methods is the PEAK search. This program permits one to identify all the mass spectra in the file that contain a specific peak (m/e value) with an intensity that falls into a given range. The data base can be searched with a second peak, and the two list of hits are then automatically intersected to produce a list of spectra that contains both peaks. Intensities are not precisely reproducible in mass spectral measurements, and so a range of acceptable intensities must be created for the purposes of a search. This can be done by the user who can specify upper and lower values for acceptable intensities. Alternatively, if he specifies only one value, the program takes this value and accepts any peak whose intensity is within +30% of it.

A different program called LOSS can be used to identify all the spectra in the file that exhibit the loss of a given neutral mass from the molecular ion. Such a search is of limited utility alone, but it can be used in conjunction with the PEAK. This is and "AND" type of search known as PEAK AND LOSS. As might be expected, it is a very powerful means of narrowing a search down rapidly to a few candidate spectra.

Other means by which the data base can be searched are given in Figure 1 and include molecular weight, partial or complete molecular formula, and MSDC code search. This last method enables one to find all entries in the file of compounds that possess a particular functional group. The MSDC codes are a series of arbitrary four-digit codes used to define functional groups and, in some cases, compound type. The code 1170, for example, applies to all aldehydes and 1670 to all carbohydrates.

Many binary AND combinations of these simple searches can be invoked, and such combinations are generally found to act as much more powerful filters than a simple search. As the data base increases in size, such methods of searching become much more advantageous, and with the file at its present size (39,509 spectra), we find that simple searches take considerably more of an operator's time than a combined search such as PEAK AND MOLECULAR FORMULA.

As an example, there are in the file 50 spectra with an ion at m/e 386 having an intensity between 33% and 100%. If, however, this search is restricted to compounds containing exactly 27 carbon atoms, only 17 spectra are found. A further example of the power of intersected searches is shown in Figure 2 in which it can be seen that a simple PEAK search with m/e 177 and intensity between 10% and 40% produces 648 spectra. Of these, 34 spectra also contain an ion with m/e 192 and intensity between 2% and 10%. If, however, the same data are used in a PMW search, in which the molecular weight is defined as 192, then instead of 648 and 34 spectra, there are retrieved only 13 and 2 spectra, respectively. It should be noted that the PMW search costs the same ($3) but gives far fewer answers for about the same investment in the user's time and is more efficient for this reason.

1. Peak and Intensity Search
2. Loss and Intensity Search
3. Molecular Weight Search
4. Code Search
5. Molecular Formula Search

(a.) Complete
(b.) Partial, Stripped
6. Peak and Loss Search
7. Peak and Molecular Weight Search
8. Peak and Molecular Formula Search
9. Peak and Code Search
10. Loss and Molecular Weight Search
11. Loss and Molecular Formula Search
12. Loss and Code Search
13. Molecular Weight and Code Search
14. Molecular Weight and Molecular Formula Search
15. Complete Spectrum Search
(a.) BIEMANN
(b.) STIRS
(c.) PBM
16. Dissimilarity Comparison
17. Spectrum/Source Print-out
18. Spectrum/Source Display
19. Spectrum/Source Plotting
20. Spectrum/Source Microfiche
21. Crab-Comments and Complaints
22. Entering New Data
(a.) Microcomputer Interface
(b.) Data Collection Sheets
23. News-News of the MSSS
24. MSDC Bulletin-Literature Search
25. CAS Registry Data
26. SSS-Substructure Search of CAS Data
27. WLN
28. Molecular Formula from Isotope Pattern
29. Molecular Weight from Spectral Data

Figure 1 Options available in the MSSS.


In contrast to the "interactive" method of searching through the mass spectral data base, there are programs (Biemann, PBM, and STIRS) that will compare an unknown spectrum sequentially to every spectrum in the file. These programs retain the best fits, ranked in order of goodness of fit. Such techniques have the advantage of being operator-independent; no decision is necessary about which peaks to enter-all the peaks are used. Their disadvantages are: 1) they are relatively extravagant of computer time and 2) they require that the complete mass spectrum be entered into the system. The latter requirement constitutes a rather discouraging preliminary.

The first of these problems has been countered by the development of a program that collects and holds the spectra of unknowns. They are then put through the searching procedure during off-peak hours, when the machine charges are considerably lower. The results of this search are available at 8 AM on the following day, which is not inconvenient for many workers. As to the second of the problems, the need to enter data into the search has been overcome by the development of an interface that permits the user to couple his own mass spectrometer-minicomputer combination directly to the network computer. Mass spectral data flow from the mass spectrometer through the microcomputer and interface to the MSSS where the search is carried out, and the answers are relayed back to the user. At present, this type of interface can be purchased to operate with the Varian, SI-150, Hewlett-Packard cassette, and Finnigan 6000 data systems, and other manufacturers are in the process of developing the necessary software for their computer systems.

Data retrieval

The remaining programs in the MSSS deal with the partial or complete retrieval of specified mass spectra from the data base. A file of complete mass spectra is available in the computer, and one may, upon completion of a search and identification of the appropriate ID number, use this number to obtain a print-out of the full spectrum or part of it. If one is using a terminal that is capable of plotting (such as the Tektronix 4000 series, the DEC GT40 series, the Zeta plotter, or the H-P plotter), then a spectrum may be plotted as a bar graph. Whether the data are reprinted or plotted, the origin of the spectrum is also given, as are the experimental conditions under which it was measured.

The use of microfiche for long-term storage of complete mass spectra has been mentioned previously. The microfiche are computer-generated and may be used with the help of a program called "FICHE." This program will accept an ID number and calculate the corresponding microfiche number as well as the location of the required frame on the microfiche. These data facilitate the viewing of microfiche in a manual viewer. In a more advanced system, the microfiche are stored in a viewer that automatically retrieves a specific fiche from a carrousel and then projects a particular frame from that fiche onto a screen. This microfiche display unit, shown in Figure 3 , can be driven by the computer, which needs only the spectrum ID number to produce the correct microfiche image. Such computer-driven microfiche viewers are capable of handling over 150,000 mass spectra on a single carrousel and can retrieve any specific spectrum in under 4 sec. Every microfiche has 192 frames, each containing one mass spectrum as a bar graph and also a listing of m/e value vs intensity. The 39,509 mass spectra, therefore, occupy 206 microfiche. An example of one frame of a microfiche is shown in Figure 4 .


PEAK AND INTENSITY SEARCH

TYPE PEAK, INT

CR TO EXIT, 1 FOR ID, MW, MF AND NAME

USER: 177,10,40

#REFS M/E PEAKS

648 177

NEXT REQUEST: 192,2,10

#REFS M/E PEAKS

34 177 192 MW AND PEAK SEARCH (CR TO EXIT)

USER: THE MW IS: 192


TYPE PEAK, INT
CR TO EXIT, 1 FOR ID, MW, MF, NAMES

USER: 177,10,40

FOUND 13 REFERENCES TO THAT COMBINATION

NEXT REQUEST: 192,2,10

FOUND 2 REFERENCES TO THAT COMBINATION



Figure 2 Comparison of a simple PEAK search with a combined

PEAK AND MOLECULAR WEIGHT search.






Applications

During the four years that the MSSS has been used on a regular basis, it has come to be of particular value in some well-identified contexts in laboratories in the academic, industrial, and government sectors. In general, the reasons for which the MSSS is used are not recorded, but examples in which the system proves to be especially helpful often become more widely known by a number of mechanisms. That the MSSS has been featured in the plot of a science fiction novel (The Swarm by Arthur Herzog) can be regarded as a form of recognition, dubious though it may be!

An early example of the value of the MSSS that was fairly well publicized involved the treatment of a six-year-old child admitted to a Denver hospital. The child had ingested some of the contents of an unlabeled bottle of liquid and was developing symptoms of serious intoxication. Mass spectrometry of the material and application of MSSS revealed the toxic principle to be parathion, confirmation of this was obtained by comparison with an authentic sample, and a vigorous course of treatment was commenced, all within an hour. In retrospect, there seems little doubt that the fortunate outcome of the episode was due, at least in part, to the MSSS.

The public is unexpectedly and seriously exposed to chemicals in a host of ways, such as oil spillage, train and truck accidents, and the release of industrial waste into the environment. Rapid and accurate identification of compounds, which are often only present at low levels, is essential in the decision about whether the chemicals pose a danger to the health of the community or the environment.

The MSSS is used in a very routine way by analysts of the EPA and was involved in a recent widely publicized identification of halogenated organic compounds in the water supplies of several cities, most notably New Orleans. The analysis of the compound in question was carried out with a gas chromatograph coupled to a mass spectrometer. The resulting mass spectra were examined with the help of the MSSS, and well over 60 distinct compounds were identified in this way. Final confirmation of these identifications was in every case arrived at by a direct comparison of the experimentally obtained spectrum with the file spectrum of the appropriate compound.

Identification of drugs and poisons is a task that is frequently undertaken for reasons that range from the purely forensic to the purely medical. The sensitivity of gas chromatography-mass spectrometry (GS-MS) and the power of the MSSS make them valuable tools in post mortem examinations. These methods are also used in local, state, and federal law enforcement laboratories in attempts to identify materials that have been sized. Adulteration of heroin with inactive "fillers" such as glucose is a very common practice. The identity of the filler compounds in a batch of illicit heroin can be determined by GC-MS and the MSSS and often gives a key to the origin of the heroin. For this reason, police laboratories in several states have adopted GC-MS as a standard technique and are also using the MSSS as a means of identifying such compounds.

In the case of drug overdoses, the identity of the drug or drugs involved is a factor in the rational choice of therapy. Such information is often unavailable from the patient and so is now obtained in a number of clinical laboratories by GC-MS analysis of an extract of the patient's urine, serum, or gastric contents and subsequent identification of individual compounds by the MSSS. An example of such a case is given in Figure 5. The gastric contents of an adult male, who was admitted comatose to a Washington, D.C., hospital, were

PEAK AND INTENSITY SEARCH

TYPE PEAK, INT
CR TO EXIT, 1 FOR ID, MW, MF AND NAME

USER: 86,100,100

# REFS M/E PEAKS

131 86

NEXT REQUEST: 99,3,30

# REFS M/E PEAKS

21 86 99

NEXT REQUEST: 183,0,3

# REFS ME/ PEAKS

2 86 99 183

NEXT REQUEST: 1

ID# MW MF NAME

16746 387 C21.H23.CL.F.N3.0 FLURAZEPAM (DALMANE*)

35743 309 C19.H35.N.02 DICYCLOMINE



Figure 5 A PEAK search for the major component in a drug overdose case.




shown by GC-MS to have one major component. The mass spectrum of this compound had three prominent ions at m/e 86, 99, and 183. When this ions were used in a PEAK search as shown in Figure 5, only two spectra, ID numbers 16746 and 35743, were retrieved as possessing these peaks. The first of the two spectra, 16746, proved to be essentially identical to that of the unknown material, while the other, 35743, although it had ions at m/e 86, 99, and 183, was otherwise quite different from that of the unknown. With this evidence, the drug involved in the overdose was tentatively identified as the tranquilizer Dalmane; and with this information in hand, it was clear that aggressive treatment was unnecessary and so the patient was simply kept under observation until he regained consciousness and a day later he was discharged.

Throughout the development of the MSSS, there has been a continual effort to create a computer system that is easy to use and that permits the chemist to bring into play his or her own expertise without requiring that he or she be particularly competent in the use of computers.

To this end, a comprehensive User's Guide, now in its fourth edition, is made available to all MSSS users, and there are, with the MSSS, many so-called HELP files which a confused user can consult and, it is hoped, use to solve his immediate problem. Users can and do write complaints and/or observations regarding the MSSS which are handled by a professional who is employed for that purpose. The clearance rate on such "CRABS" is close to 100%.

In terms of simplicity for the user, however, the most important element is clearly the program itself. At any point in a session, the options available to the user are sufficient, but no more than sufficient. In each case, the computer prompts are concise and unambiguous, and as a result, it is not uncommon to observe a new user learning to use the MSSS without having to resort to the User's Guide, but just exploring the programs and succeeding in carrying out searches.

At the same time, we find that more sophisticated methods of searching are used rather little. The more direct searches do give the correct answers, albeit less efficiently, and users are satisfied by this. The result is a gentle but continual economic pressure on users to search in the most efficient way possible. As an example, the judicious selection of ions that are used in a PEAK search usually means that only three or four ions should be necessary to complete a search. If one uses very commonly occurring ions, however, such as m/e 43, 57, or 71, then more entries will be needed to finish a search. Users are encouraged to select peaks with a little thought by a pricing scheme which ensures that the cost of a PEAK search is $3 until more than five peaks are used, at which point the price rises to $7.

Economics

The economics basis of the MSSS is still evolving, but enough experience has been gained to permit some observations to be made. The expense involved in assembling a data base such as that used in MSSS is considerable. The spectra are, in general, extant, but a surprisingly large amount of manpower is necessary to locate, check, copy, and assemble them into a useable data base. A second important economic consideration is the cost of storing a large data base in a computer. This cost is directly proportional to the size of the data base, and a typical cost on the commercial market for a collection of 39,000 mass spectra is on the order of $3,000 per month.

Both these points, taken together, suggest that it is more economical to keep one copy of the data base on disk and make it accessible to many users via a computer network, as has been done in the case of the MSSS. Multiple copies of the file will create formidable problems in updating and will, of course, imply multiple monthly computer charges for storage.

In the MSSS, the day-to-day expenses for file maintenance are borne by the MSDC, which attempts to recover some of these costs from users. The latter pay a subscription fee for admission to the system, and in addition to this, they are also required to pay the computer service company (Cyphernetics Corporation) in proportion to the extent of their use of the system. The subscription fee is $300 per organization per year ($400 for the first year), and the usage fee is fixed, but it depends upon which option is used. A simple PEAK search, for example, costs $3 (unless more than five peaks are entered), and other searches cost from $1 to $6; the combined searches are generally the more expensive. As a consequence of program design, search times and costs are almost independent of the size of the data base. The sequential search discussed previously normally costs $6 but can be run overnight, in which case the cost is $2. Thoughtless use of the searching programs will increase one's costs without improving the quality of results. As an example, a search for all spectra with an ion at m/e 43 and intensity between 10% and 90% will generate the useless result that some 6,000 spectra satisfy these criteria.

Uses of subsets of the data

It is a fact that as the data base becomes larger, it is more difficult to extract from it a small number of spectra using a given amount of data. Thus while a PEAK search using two peaks may have given only one answer when there were 10,000 spectra in the file, the same two peaks might now cause the retrieval of five spectra, and entry of a third peak may well be necessary to narrow down the choices sufficiently. It also seems clear that it is inefficient to attempt to identify compounds of low molecular weight using a file that contains data from many compounds of much higher molecular weight. This is because the latter spectra very frequently will contain the peaks possessed by the former. For this reason, we are presently considering the possibility of subdividing the file in the future. Currently, and intersected search has the effect of subdividing the file or creating a small subfile that is then searched. For example, a PEAK AND CODE search can be used to isolate all the steroids in the main file by invoking the MSDC code 1710 and then search through them for specific spectral features. The same effect would be achieved if all the steroids were in a single file separate from the main data base. This subfile would only be searched upon a specific command. The advantage of this would be that the main file would not be "cluttered" with steroid spectra, and the obvious disadvantage is that this approach implies a prejudice of sorts on all searches of the main data base. It is also not clear what groups of compounds would qualify for inclusion into a subfile. Pesticides would seem to be fairly clear-cut group of compounds that, in general, are of interest to a relatively few people. An unambiguous definition of "pesticide" is not available, however, and even if it were, the segregation of pesticides from the main file will present a daily question for toxicologists, for example, whose encounters with pesticides are rare but not unknown. A good case can be made for subfiles where proprietary information is involved. In this case, a subfile of such data could be assembled and used by those authorized to do so. Other users of MSSS need not even be aware of the existence of such subfiles.

Future prospects

An aspect of the MSSS that is currently receiving considerable attention relates to the quality of the data in the file. Work is in progress to check some parts of this such as molecular formulas and molecular weights. It is more difficult to check the quality of the mass spectra themselves, and there is also a problem that is related to the question of redundant spectra. The whole file has now been checked by the Chemical Abstracts Service, which has assigned CAS registry numbers to each compound in the file. These number can be used to find the repeated appearance of the same compound. With this done, the multiple spectra will all be examined using a computer program developed for EPA by McLafferty and his group at Cornell. This program examines the spectra for defined errors such as the presence of intense iones at m/e values higher than the molecular weigh of +2, losses of unusual neutral masses such as 5 or 7 amu, and so on. On the basis of the presence or absence of such phenomena, a quality index (QI) can be assigned to the spectrum, and this index can be used to identify the inferior redundant spectra in the file, so that they can be removed. Other experiments are being conducted with a view to examining the feasibility of averaging different spectra of the same compound and so arriving at a single representative spectrum that can be retained while the other spectra are removed from the file.

Work in all of these areas is now well advanced, and it is hoped that a greatly improved data base will be available this year. Other items that will be merged into the file within a matter of months include revised MSDC codes, registry numbers, and Wiswesser line notations for the structures. Programs have been written and are under test that will permit an examination of the file from a structural point of view. As an example, one might wish to identify all compounds in the file that contain a pyrrole ring and have some particular mass spectral characteristics. A program that calculates the best molecular weight from the mass spectrum has been written by Dromey at Stanford and will be made available as a component of the MSSS in the near future. A program that can analyze intensity values and calculate isotope incorporations has recently been added to the pilot version of the MSSS and should become generally available soon. Finally, searching of the mass spectral literature via the Mass Spectrometry Bulletin is now possible and should also be made available this year.

Summary

The purpose of this article has been to describe the capabilities of the MSSS as well as its background and the plans for its future development. The system is now in a relatively stable form on the Cyphernetics computer network. We would also be pleased to learn of new sources of mass spectral data and will be happy to acquire and process such data.






Dr. Milne is with the National Heart and Lung Institute, National Institutes of Health, and Dr. Heller is Computer Specialist, Environmental Protection Agency. The authors would like to thank K. Biemann of MIT for providing the data base that was originally used in the development of the MSSS. They would also like to thank all of their colleagues who have assisted greatly in the development of MSSS. In particular, they would like to thank the following: A. Bridy, W. Budde, H. M. Fales, R. J. Feldmann, R. S. Heller, T. L. Isenhour, D. Maxwell, A. McCormick, J. McGuire, F. W. McLafferty, M. Springer, V. Vinton, S. Woodward, and M. Yaguda.