THE MSDC/EPA/NIH MASS SPECTRAL SEARCH SYSTEM
S. R. Heller
Environmental Protection Agency 401 M Street S.W. (PM-218) Washington, DC 20460, USA
A large file of unique and high quality mass spectral data has been assembled in a collaborative effort involving the U.S. Environmental Protection Agency (EPA), the U.S. National Institutes of Health (NIH) and the U.K. Mass Spectrometry Data Centre (MSDC). This file of 29,936 spectra, together with the programs for searching through it has been made available to the international scientific community via a time-sharing computer network. This "Mass Spectral Search System" (MSSS) has been in operation for almost five years during which time an estimated 50,000 searches have been completed. Currently there are about 200 working accounts which are using the system on a daily basis.
The last few years has seen the gradual development primarily at NIH and EPA of a Chemical Information System (CIS) on an interactive time-sharing DEC PDP-10 which contains several collections of data such as mass spectra, carbon nuclear magnetic resonance spectra and x-ray diffraction data. Of all these components, the MSSS is the most highly developed and has been operating on a commercial basis for three years. It is available on a fee-for-service basis via the ADP-Network Services Inc., Cyphernetics Division computer network which can be reached by a local telephone call throughout most of North America and Western Europe and via Telex throughout the world.
This paper describes the MSSS and provides a report on the status of the system.
3. Methods of Searching
As a practical matter, it is preferable that the time necessary to search through a data base be largely independent of the size of the data base. In the case of MSSS, this is accomplished using an inverted file technique with a data base of "abbreviated" mass spectra. These are spectra in which only the two most intense peaks in consecutive gaps of 14 atomic mass units (emu) are retained. The resulting file is only about 30 percent the size of the full file, but, as has been shown in Biemann and his group at MIT, contains essentially all the information that was in the unabbreviated spectra. Although the abbreviated file is used for efficient searching, the full file is also stored in the computer in order that users may retrieve complete mass spectra to confirm identifications.
Searching through this mass spectra data base can be accomplished in a variety of ways. These programs are summarized in table 1. Probably the most important of these methods is the "PEAK" search. This program permits one to identify all the mass spectra in the file that contain a specific peak (m/e value) with an intensity that falls into a given range. The data base can be searched with a second peak and the two lists of hits are then automatically intersected to produce a list of spectra that contain both peaks. Intensities are not precisely reproducible in mass spectral measurements and so a range of acceptable intensities must be created for the purposes of a search. This can be done by the user who
can specify upper and lower values for acceptable intensities. Alternatively, if he specifies only one value, the program takes this value and accepts any peak whose intensity is within + 30 percent of it.
Mass spectral search system (MSSS) Current and future options
1. Peak and intensity search
2. Loss and intensity search
3. Molecular weight search
4. Code search
5. Molecular formula search
(b) partial, stripped
6. Peak and loss search
7. Peak and molecular weight search
8. Peak and molecular formula search
9. Peak and code search
10. Loss and molecular weight search
11. Loss and molecular formula search
12. Loss and code search
13. Molecular weight and code search
14. Molecular weight and molecular formula search
15. Complete spectrum search
16. Dissimilarity Comparison
17. Spectrum/Source Print-out
18. Spectrum/source display
19. Spectrum/source plotting
20. Spectrum/source microfiche
21. Crab-comments and complaints
22. Entering new data
(a) mini-computer interface
(b) data collection sheets
23. News-news of the MSSS
24. MSDC Bulletin-literature search
25. CAS Registry Data
26. SSS-substructure search of CAS data
28. Molecular formula from isotope pattern
29. Molecular weight from spectral data
A different program called LOSS can be used to identify all the spectra in the file that exhibit the loss of a given neutral mass from the molecular ion. Such a search is of limited utility alone but it can be used in conjunction with the PEAK. This is an "and" type of search known as PEAK AND LOSS. As might be expected, it is a very powerful means of narrowing a search down rapidly to a few candidate spectra.
Other means by which the data base can be searched are given in table 1 and include molecular weight, partial or complete molecular formula or code search. This last method enables one to find all entries in the file of compounds that possess a particular functional group. The codes are a series of arbitrary multi-digit codes that are used to define functional groups and in some cases, compound type.
Many binary "AND" combinations of these simple searches can be invoked and such combinations are generally found to act as much more powerful filters than a simple search. As the data base increases in size, such methods of searching become much more advantageous and with the file at its present size (29,936 spectra), we find that simple searches take considerably more of an operator's time than a combined search such as PEAK AND MOLECULAR FORMULA.
In contrast to the "interactive'' method of search through the mass spectral data base, there are programs (Biemann, PBM and STIRS) that will compare an unknown spectrum sequentially to every spectrum in the file. These programs retain the best fits, ranked in order of goodness of fit. Such techniques have the advantage of being operator independent, no decision is necessary as to which peaks to enter, all the peaks are used. The disadvantage of these methods are that they are relatively extravagant of computer time and they require that the complete mass spectrum be entered into the system. If this has to be typed in, this constitutes a rather discouraging preliminary.
The first of these problems has been countered by the development of a program which collects and holds the spectra of unknowns. It then puts through the searching procedure during off-peak hours, when the machine charges are considerably lower. The results of this search are available by 8:00 a.m. on the following day, which is not inconvenient for many workers. The second of the problems: the entering of data into the search has been overcome by the development of an interface that permits the user to couple his own mass spectrometer-minicomputer combination directly to the network computer. The search is carried out and the answers are relayed back to the user by way of mass spectral data flow from the mass spectrometer through the minicomputer and interface to MSSS. At present, this type of interface can be purchased to operate with the Varian, SI-150, Hewlett-Packard cassette and disk, INCOS and Finnigan 6000 data systems. Other manufacturers are in the process of developing the necessary software for their computer systems.
4. Data Retrieval
The remaining programs in MSSS deal with the partial or complete retrieval of specified mass spectra from the data base. A file of complete mass spectra is available in the computer and one may, upon completion of a search and identification of the appropriate ID #, use this number to obtain a printout of the full spectrum or part of it. If one is using a terminal that is capable of plotting (such as the Tektronix 4000 series, the DEC GT40 series, Zeta Plotter, H-P Plotter, etc. ) then a spectrum may be plotted as a bar graph. Whether the data are reprinted or plotted, the origin of the spectrum is also given as are experimental conditions under which it was measured.
During the five years that MSSS has been used on a regular basis, it has come to be of particular value in some well identified contexts in laboratories in the academic, industrial and government sectors. In general, the reasons for which MSSS is used are not recorded, but examples in which the system proves to be especially helpful often become more widely known by a number of mechanisms. That MSSS has been featured in the plot of a science fiction novel ("The Swarm" by Arthur Herzog) can be regarded as a form of recognition, dubious though it may be!
An early example of the value of the MSSS that was fairly well publicized involved the treatment of a six-year old child admitted to a Denver hospital. The child had ingested some of the contents of an unlabeled bottle of liquid and was developing symptoms of serious intoxication. Mass spectrometry of the material by the local Denver EPA lab and application of MSSS revealed the toxic principle to be parathion. Confirmation of this was obtained by comparison with an authentic sample and a vigorous course of treatment was commenced, all within an hour. In retrospect, there seems little doubt that the fortunate outcome of the episode was due, at least in part, to the MSSS.
The public is unexpectedly and seriously exposed to chemicals in a host of ways, such as oil spillage, train and truck accidents and the release of industrial waste into the environment. Rapid and accurate identification of compounds, which are often only present at low levels, is essential in the decision as to whether the chemicals pose a danger to the health of the community or the environment.
The MSSS is used in a very routine way by analysts of the EPA and was involved in the recent, widely publicized, identification of halogenated organic compounds in the water supplies of several cities, most notably New Orleans. The analysis of the compounds in question was carried out with a gas chromatograph coupled to a mass spectrometer, the resulting mass spectra were examined with the help of the MSSS and well over sixty distinct compounds were identified in this way. Final confirmation of these identifications was in every case arrived at by a direct comparison of the experimentally obtained spectrum with the file spectrum of the appropriate compound.
Identification of drugs and poisons is a task that is frequently undertaken for reasons that range from the purely forensic to the purely medical. The sensitivity of gas chromatography-mass spectrometry (GC-MS) and the power of MSSS make them valuable tools in post mortem examinations. These methods are also used in local, state and federal law enforcement laboratories in attempts to identify materials that have been seized. Adulteration of heroin with inactive "fillers" such as glucose is a very common practice. The identity of the filler compounds in a batch of illicit heroin can be determined by GC-MS and MSSS and often gives a key as to the origin of the heroin. For this reason, police laboratories in several states have adopted GC-MS as a standard technique and are also using the MSSS as a means of identifying such compounds.
Throughout the development of the MSSS, there has been a continual effort to create a computer system that is easy to use and that permits the chemist to bring into play his own expertise without requiring that he be particularly competent in the use of computers.
To this end, a comprehensive User's Guide, now being edited for its fifth edition is made available to all MSSS users and there are, within the MSSS, many so-called 'HELP" files which a confused user can consult and hopefully, use to solve his immediate problem. Users can and do write complaints and/or observations regarding the MSSS which are handled by a professional who is employed for that purpose.
In terms of simplicity for the user, however, the most important element is clearly the program itself. At any point in a session, the options available to the user are sufficient, but no more than sufficient. In each case, the computer prompts are concise and unambiguous and as a result, it is not uncommon to observe a new user learning to use the MSSS by just exploring the programs and succeeding in carrying out searches.
At the same time, we find that more sophisticated methods of searching are used rather little. The more direct searches do give the correct answers, albeit less efficiently, and users are satisfied by this. The response to this, that is evolving, is to keep gentle but continual economic pressure on users to search in the most efficient way possible. As an example, the judicious selection of ions that are used in a PEAK search usually means that only three or four ions should be necessary to complete a search. If one uses very commonly occurring ions, however, such as m/e 43, 57, 71, and so on, then more entries will be needed to finish a search. Users are encouraged to select peaks by a pricing scheme which ensures that the cost of a PEAK search is $3.00 until more than five peaks are used, at which point the price rises to $7.00.
The economic basis of the MSSS is still evolving, but enough experience has been gained to permit some observations to be made. The expense involved in assembling a data base such as is used in MSSS is very considerable. The spectra are in general, extant, but a surprisingly large amount of manpower is necessary to locate, check, copy and assemble them into a usable data base. A second important economic consideration is the cost of storing a large data base in a computer. This cost is directly proportional to the size of the data base.
Both these points, taken together, suggest that it is more economical to keep one copy of the data base on disk, and make it accessible to many users via a computer network, as has been done in the case of MSSS. Multiple copies of the file will create formidable problems in updating and will, of course, imply multiple monthly computer charges for storage.
In the MSSS, the day-to-day expenses for file maintenance are borne by the MSDC, which attempts to recover some of these costs from users. The latter pay a subscription fee for admission to the system and in addition to this, they are also required to pay the computer service company (Cyphernetics) in proportion to the extent of their use of the system. The subscription fee is $300 per organization per year and the usage fee is fixed, but dependent upon which option is used. A simple PEAK search, for example costs $3.00 (unless more than five peaks are entered) and other searches cost from $1.00 to $6.00; the combined
searches are generally the more expensive. As a consequence of program design, search times and costs are almost independent of the size of the data base. The sequential search discussed above normally costs $6.00 but can be run overnight, in which case the cost is $2.00. Thoughtless use of the searching programs will increase one's costs without improving the quality of results. As an example, a search for all the spectra with an ion at m/e 43 and intensity between 10 percent and 90 percent will generate the useless result that .some 5,000 spectra satisfy these criteria.
7. Use of Subsets of the Data
It is a fact that as the data base becomes larger, it is more difficult to extract from it a small number of spectra using a given amount of data. Thus while a PEAK search using two peaks may have given only one answer, when there were 10,000 spectra in the file, the same two peaks might now cause the retrieval of five spectra and entry of a third peak may well be necessary to narrow down the choices sufficiently. It also seems clear that it is inefficient to attempt to identify compounds of low molecular weight using a file that contains data from many compounds of much higher molecular weight. This is because the latter spectra very frequently will contain peaks possessed by the former. For this reason, we are presently considering the possibility of sub-dividing the file in the future. Currently, an intersected search has the effect of sub-dividing the file or creating a small sub-file that is then searched. For example, a PEAK AND CODE search can be used to isolate all the chlorinated compounds in the main file by invoking the proper code and then search through them for specific spectral features. The same effect would be achieved if all the chlorinated compounds were in a single file separate from the main data base. This subfile would only be searched upon a specific command. The advantage of this would be that the main file would not be "cluttered" with steroid spectra. The obvious disadvantage is that this approach implies a prejudice of sorts on all searches of the main data base. It is also not clear what groups of compounds would qualify for inclusion into a sub-file. Pesticides would seem to be a fairly clear-cut group of compounds that in general, are of interest to a relatively few people. An unambiguous definition of "pesticide" is not available, and even if it were, the segregation of pesticides from the main file will present a daily question for toxicologists, whose encounters with pesticides are rare, but not unknown. A good case can be made for sub-files where proprietary information is involved. In this case, a sub-file of such data could be assembled and used by those authorized to do so. Other users of MSSS need not even be aware of the existence of such sub-files.
8. Present Progress
An aspect of the MSSS that is currently receiving considerable attention relates to the quality of the data in the file. Work is in progress to check some parts of this such as molecular formulas and molecular weights. It is more difficult to check the quality of the mass spectra themselves and there is also a problem that is related to the question of redundant spectra. The whole file has not been checked by the Chemical Abstracts Service, which has assigned CAS Registry Numbers to each compound in the file. These numbers were then used to find the repeated appearance of the same compound. This done, the multiple spectra were examined using a "Quality Index" computer program developed for EPA by McLafferty and his group at Cornell. On the basis of the presence or absence of various phenomena, a "Quality Index" (QI) was assigned to each spectrum, and this index used to identify the inferior redundant spectra in the file. They were then removed to produce the file of 29,936 unique spectra.
Another item that will be merged into the file within a matter of months is the Wiswesser Line Notation for the structures. Programs have been written and are under test that will permit an examination of the file from a structural point of view. As an example, one might wish to identify all compounds in the file that contain a pyrrole ring and have some particular mass spectral characteristics. A program that calculates the best molecular weight from the mass spectrum has been written by Dromey at Stanford and will be made available as a component of MSSS in the near future. A program that can analyze intensity values and calculate isotope incorporations has recently been added to the pilot version of MSSS and should become generally available soon. Finally, searching of the mass spectral literature via the Mass -Spectrometry Bulletin from 1966-1975 is now possible.
The purpose of this paper has been to describe the capabilities of the MSSS. The system is now in a relatively stable form on the ADP-Cyphernetics computer network and inquiries regarding its use are invited. We would also be pleased to learn of new sources of mass spectral data and will be happy to acquire and process such data.
Details on obtaining an account with the Cyphernetics network can be obtained from The Manager, Data Base Services, Cyphernetics, 175 Jackson Plaza, Ann Arbor, Michigan 48106, telephone: 313-769-6800, or from The Manager, Cyphernetics International, J. C. van Markenlaan 3, Postbus 286, RijswiJk (Z.H.), The Netherlands, telephone: 070-94-88-66.
The author would like to thank Professor K. Biemann of MIT for providing the data base that was originally used in the development of the MSSS. They would also like to thank all of their colleagues who have assisted greatly in the development of the MSSS. In particular, they would like to thank the following: G. W. A. Milne, A. Bridy, W. Budde, H. H. Fales, R. J. Feldmann, R. S. Heller, T. L. Isenhour, D. Maxwell, A. McCormick, J. McGuire, F. W. McLafferty, M. Springer, V. Vinton, S. Woodward, and M. Yagyda.