AN ON-LINE SEARCH SYSTEM FOR THE MASS SPECTROMETRY LITERATURE

V. A. VINTON and G. W. A. MILNE

Laboratory of Chemistry, National Heart, Lung and Blood Institute, National Institutes of Health, Bethesda, Md., 20014 (U.S.A.)

S. R. HELLER*

Environmental Protection Agency, Washington, D. C., 20460 (U.S.A.)

(Received 21st March 1977)

SUMMARY

Computer programs have been written for interactive, conversational searching through the accumulated files of the 'Mass Spectrometry Bulletin' published by the Mass Spectrometry Data Centre (MSDC). This family of programs is available for use as a part of the NIH/EPA Chemical Information System on a commercial computer network.

The enormous recent increase in the number of groups working on mass spectrometry has led to a variety of problems, amongst which are the difficulties caused by a rapidly expanding literature. Prior to 1960, the annual number of papers dealing with mass spectrometry was small and fairly constant, reflecting the output of the relatively few scientists interested. Then there was a rapid growth of activity, as mass spectrometers that could be used in analytical organic chemistry became available. Further advances have increased this rate of growth so that the number of papers on mass spectrometry is now doubling in less than ten years. In view of this activity, the farsighted decision was made by the MSDC in 1965 to examine the literature on a day-to-day basis for papers dealing with mass spectrometry. Citations and abstracts were collected and published on a monthly basis as the Mass Spectrometry Bulletin [1], which is now a generally recognized authority on the literature of mass spectrometry.

As with any printed data base, the Bulletin becomes more difficult to use as it gets larger. Apart from the physical size of the accumulated material, indexing becomes more crucial and cumulative indexes are needed. Simultaneously, the value of computer methods for handling these problems has become clearer. Since 1970, the U.K. and U.S. Governments have collaborated, in the development of the MSDC-NIH-EPA Mass Spectral Search System (MSSS) [2l. As an extension of this work, a study of the use of the computer readable files of the Bulletin as the basis of an on-line, interactive literature search system has been undertaken. This system is the subject of this paper.

Data base

The Bulletin currently publishes some 6,500 citations per year. The complete data base now requires about 20 million characters of storage space. By the end of 1975 there were a total of 56 567 citations; 139 392 author entries involving 49 309 authors and 42 632 element entries. There are 272 741 entries in the subject index, 46 735 in the general index, and 76 520 compounds are identified by compound class.

SEARCH PROGRAMS

In 1970, Hertz et al. [3] described a computer search system that used the files of the Mass Spectrometry Bulletin. This system was designed to use tape copies of the data base and so overcome the difficulties associated with a search system operating on a central large computer. The programs were run on an IBM 1802 and permitted searches for papers according to author, subject or element. Alternatively, searches could be made for compound type, elemental composition or molecular weight. Intersected searches, e.g. for all indoles with a given molecular weight, could be effected. These programs used all the parameters that had been used in the indexing of the Bulletin and so offered an alternative means of using the information in the Bulletin. They also allowed, in principle, an individual to conduct the search without recourse to a distant computer. The single disadvantage of the system is that because the search is of a data base that is stored on magnetic tape, it is relatively slow; a typical search requires about 5 min. the limiting step being the transfer of data from the tape.

Since 1971, the use of disk, as opposed to tape, for the storage of large, searchable data bases has been examined. The cost of disk storage for a given data base, while considerably more than the cost of tape storage for the same data, is currently decreasing very rapidly, and now the cost of storing even a large data base on disk can be manageable. The great advantage of disk over tape is the vastly greater speed with which the information can be located and retrieved. This rapid retrieval permits construction of a truly interactive search system, in which the user, having asked a question, need wait only moments to receive an answer which may be sufficient or may permit restatement of the question.

The NIH-EPA Chemical Information System (CIS), a collection of numerical and literature data bases together with the necessary search programs [4], uses disk-stored data bases exclusively. Much of the CIS runs on a commercial networked computer which provides on-line access to data--a feature which demands that the data be stored on disk. Each data base is organized into two or three separate disk files, as has been described for the MSSS [5]. The major one is the so-called reference file which contains all the data necessary for a search, organized in a specific way. The other file is a pointer file, which is the file that is first reached by the search program. From the pointer file, the program finds where the information needed begins in the reference file

and where it ends. The program then reads that part of the reference file in preparing the answer to the user's query.

As an example, if the user asks for all papers by Atkins to be located, the program goes to the appropriate pointer file, which is simply an alphabetically ordered file of the names of all authors whose papers are in the data base. Once the name 'Atkins' has been found in the pointer file, it is simple to read off starting and ending addresses of 'Atkins' entries in the reference file, and from these two numbers, the number of 'Atkins' entries can be calculated and reported back to the user.

With this general technique, the data base can be searched in a variety of ways. Every paper, when it is taken from the literature, is assigned a certain number of subject keywords from a list that currently has 327 entries. In addition, any element that is studied specifically is noted as are the names of all the authors and the MSDC Compound Classification Codes of compounds described. The MSDC codes are a group of 87 four-digit codes that represent functional groups or compound types appropriate to the compounds discussed. In addition to these descriptors, there is also a 'general index' of terms which do not logically belong in the other indexes. This index includes, for example, the names of microorganisms that have been studied and the venues of meetings dealing with mass spectrometry.

These five descriptors, subject, element, author, compound class and general index term, form the basis of the indexes to the Bulletin; they can readily be stripped from the data base as it is written on the magnetic tape and they have therefore been used as the parameters with which the computer search can operate.

RESULTS

In the subject search, the user is asked to enter the subject of interest. To counter ambiguities in phraseology and spelling, the system searches through the data base for occurrence of the first three letters provided by the user, so if the word 'electron' is entered, entries to subject codes such as electron, electric and electronic will be retrieved. These are then presented to the user who is asked to select one or more of these temporary files. The program then accepts another subject word or, if so requested, lists the references that have already been retrieved. Before any references are listed, however, the user is asked whether or not references should be listed irrespective of the year of their appearance in the Bulletin. An affirmative answer results in all the references being printed, but a negative answer requires the user to specify the years for which citations should be retrieved. Once the years in question, (e.g. 73, 74, 75), are defined, only those references will be listed.

An example of such a search is given in Table 1. Here the user was searching for any papers dealing with mass measurement in connection with chemical ionization mass spectrometry. The first subject word entered was 'mass measurement' and the subjects that were encountered were mass discrimination (file 1, 194 references), mass fragmentation (266 references), mass measurement (463), mass spectra (11 422), spark-source mass spectrometry (817), theory of mass spectra (375), and time-resolved mass spectrometry (357)--all subject phrases containing the word 'mass'. The user is then asked which of the seven answer files are of interest and responds by specifying file 3. The other files are then discarded and the user is prompted for a second subject word. This is 'chemical ionization', which prompts a search which leads to answer files for 'chemical binding energy', 'chemical ionization', 'chemical reactions', and 'radiation chemistry'. Once the user selects the second of these, chemical ionization, these 678 references are automatically combined in a Boolean AND operation with the 463 papers previously retrieved as relevant to mass measurement. The result is an answer file containing the 8 papers that discuss mass measurement and chemical-ionization mass spectrometry. These can be listed for any or all Bulletin years, or a third subject term can be introduced and the search continued. This complete session, from log-in to log-off takes less than 30s and the DEC PDP-10 cpu time required is less than 5 s.

There are currently a total of 327 specific subject code words in the data base. These range from the relatively non-specific (e.g. impact, electric and so on), to quite precise terms such as scatter-electrons or quadrupole.

While the example in Table 1 illustrates the use of a Boolean AND combination of two separate queries, the use of a NOT combination is also possible. If the user specifies a subject word, followed by the operator NOT, then the program will produce all the citations that do not deal with that particular subject.

In the preparation of the Bulletin, the abstracts of papers that deal with specific elements are so coded. Elements that are merely part of a molecule under study (e.g. C, H. O. N etc.) are not usually noted, but a paper dealing with the secondary ion emission from copper surfaces, for example, would be considered as dealing with copper, and copper would be one of the keywords. The element search simply uses these element keywords to retrieve all the papers that deal with specific elements. In using this program, one has simply to specify the atomic symbol of the element in question and the program, using the pointer file/reference file technique discussed earlier, responds by telling the user how many citations are retrieved. The user is then free to list these (limiting the listing to certain years if so desired) or enter another element and continue to narrow down the number of citations. As in the subject search, AND or NOT combinations are possible. Thus a search for papers dealing with iron produced 483 references (Table 2). Only 146 papers, however, deal with both iron and cobalt, and if those papers that also deal with nickel are excluded, 24 references remain. This program is of particular value to those interested in inorganic chemistry, metallurgy and surface phenomena, and demonstrates the breadth of coverage of the Bulletin beyond the boundaries of organic mass spectrometry.

A third type of search is for papers published by a specific author or group of authors. A particular problem here is that people's surnames are, in effect, trivial names and are not spelled predictably. Cursory examination of the Bulletin file shows that at some step of the publication and abstracting process, a given author's name may be changed. The author search developed for this system attempts to deal with this problem by prompting the user for the name of the author whose papers are the object of the search. The actual search however, is carried out with the first four letters of the surname. All the papers by authors whose surnames begin with these four letters will be retrieved and the user must then decide which are appropriate. Table 3 gives an example of an author search in which the goal was to locate any papers published by Aberth and Anbar of Stanford Research Institute. When the first name, Aberth, is entered, a list of nine authors whose names begin with 'aber' is produced. From this list, it can be seen that the same name may be spelled in more than one way and people's initials are far from constant. In this case, it seems reasonable to guess that W. Aberth and W. H. Aberth are the same person, and so index numbers 8 and 9 are selected, resulting in a file of 36 papers by one or other of those authors. When the next author, Anbar, is specified, the result is simpler because he is unique in having 'anba' as the first four letters of his name. When the two files--Aberth papers and Anbar papers--are intersected, 10 papers by Anbar and either W. Aberth or W. H. Aberth are present. The observation that W. Aberth and W. H. Aberth both publish with Anbar, but never together strongly suggests that they must be the same person.

As in the searches discussed above, NOT logic is available in the author search. For example, one could, by using the operator NOT after Anbar's name, retrieve all the papers that the Aberths have published without Anbar as co-author.

The use of the first four letters of an author's name as described here has the obvious advantage that the user need not spell the full name accurately and, significantly, need not know the correct initials. Even if the user knows the correct initials, the search need not be accurate because, as the example in Table 3 shows, scientists do not use their own initials consistently. The major disadvantage of the four-letter search is that an entry such as 'smit' produces a total of 167 candidates, including 159 Smiths. Use of a five-letter search would not solve this problem, and it may prove necessary to permit the user to supply author's initials to alleviate this difficulty. The only inconvenience of the search as it stands however, is the time it takes to list all the candidates; this listing does not affect the computer costs seriously.

The published version of the Bulletin contains a 'General Index', which in earlier issues was termed the 'Materials Index'. This index contains entries which cannot sensibly be placed in other indexes and yet are useful for information retrieval. Such entries include types of ions (e.g. Ar(+)) and compounds (e.g. metal chelates), biological classifications (e.g. Klebsiella) and even the venues (Berlin, Dallas) of mass spectrometry meetings. As a result, this General Index can occasionally be extraordinarily useful as a means of locating a specific piece of information.

The program 'Index' can be used to search through the accumulated general indexes and operates in the same way as the subject search described above. An example of the use of this program is given in Table 4. In the first case, the user was able very rapidly to locate the one paper dealing with carbon dioxide that was presented at the 1975 NATO Advanced Study Institute held in Biarritz.

This program, like the author search, uses the first four letters of the index term supplied by the user. This rarely produces more than a few possible answers; problems such as that arising from names like 'Smith' apparently do not occur in this case. The General Index is not static and can grow more or less rapidly as more papers are abstracted. This raises the possibility that newly introduced terms cannot be used immediately for searching. Such terms will be inaccessible until the next annual update of the searchable data base.

Dissemination of the system

The Mass Spectrometry Search System (MSBULL) is available as an extension of the Mass Spectral Search System. this system is maintained by MSDC on the ADP-Cyphernetics international computer network. In addition to the searching programs, the system, like MSSS, contains user assistance files that can be accessed by the user who needs help with the conversational style of the system. The user can also list the currently used subject codewords or the MSDC compound classification codes.

We acknowledge the generous cooperation received from Dr. A. McCormick and Mr. D. C. Maxwell of the Mass Spectrometry Data Centre, Aldermaston, England, and from Mr. A. C. Nicholas of the Department of Industry of the U.K. Government.

REFERENCES

1 Mass Spectrometry Bulletin, Mass Spectrometry Data Centre, AWRE, Aldermaston, Berks., England.

2 S. R. Heller, H. M. Fales and G. W. A. Milne, Org. Mass Spectrom., 7 (1973) 107; S. R. Heller, D. A. Koniver, H. M. Fales and G. W. A. Milne, Anal. Chem., 46, 947 (1974); S. R. Heller, R. J. Feldmann, H. M. Fales, and G. W. A. Milne, J. Chem. Doc., 13 (1973) 130; G. W. A. Milne and S. R. Heller, J. Chem. Inf. Comp. Sci., (1976) in press.

3 H. S. Hertz, D. A. Evans and K. Biemann, Org. Mass Spectrom.. 4 (1970) 453.

4 S. R. Heller, G. W. A. Milne and R. J. Feldmann, Science, (1977) in press.

5 S. R. Heller, Anal. Chem., 44 (1972) 1951.