AN ON-LINE SEARCH SYSTEM FOR THE MASS SPECTROMETRY LITERATURE
V. A. VINTON and G. W. A. MILNE
Laboratory of Chemistry, National Heart, Lung and Blood
Institute, National Institutes of Health, Bethesda, Md., 20014
(U.S.A.)
S. R. HELLER*
Environmental Protection Agency, Washington, D. C., 20460
(U.S.A.)
(Received 21st March 1977)
SUMMARY
Computer programs have been written for interactive,
conversational searching through the accumulated files of the
'Mass Spectrometry Bulletin' published by the Mass Spectrometry
Data Centre (MSDC). This family of programs is available for use
as a part of the NIH/EPA Chemical Information System on a
commercial computer network.
The enormous recent increase in the number of groups working on
mass spectrometry has led to a variety of problems, amongst which
are the difficulties caused by a rapidly expanding literature.
Prior to 1960, the annual number of papers dealing with mass
spectrometry was small and fairly constant, reflecting the output
of the relatively few scientists interested. Then there was a
rapid growth of activity, as mass spectrometers that could be
used in analytical organic chemistry became available. Further
advances have increased this rate of growth so that the number of
papers on mass spectrometry is now doubling in less than ten
years. In view of this activity, the farsighted decision was made
by the MSDC in 1965 to examine the literature on a day-to-day
basis for papers dealing with mass spectrometry. Citations and
abstracts were collected and published on a monthly basis as the
Mass Spectrometry Bulletin [1], which is now a generally
recognized authority on the literature of mass spectrometry.
As with any printed data base, the Bulletin becomes more
difficult to use as it gets larger. Apart from the physical size
of the accumulated material, indexing becomes more crucial and
cumulative indexes are needed. Simultaneously, the value of
computer methods for handling these problems has become clearer.
Since 1970, the U.K. and U.S. Governments have collaborated, in
the development of the MSDC-NIH-EPA Mass Spectral Search System
(MSSS) [2l. As an extension of this work, a study of the use of
the computer readable files of the Bulletin as the basis of an
on-line, interactive literature search system has been
undertaken. This system is the subject of this paper.
Data base
The Bulletin currently publishes some 6,500 citations per year.
The complete data base now requires about 20 million characters
of storage space. By the end of 1975 there were a total of 56 567
citations; 139 392 author entries involving 49 309 authors and 42
632 element entries. There are 272 741 entries in the subject
index, 46 735 in the general index, and 76 520 compounds are
identified by compound class.
SEARCH PROGRAMS
In 1970, Hertz et al. [3] described a computer search system that
used the files of the Mass Spectrometry Bulletin. This system was
designed to use tape copies of the data base and so overcome the
difficulties associated with a search system operating on a
central large computer. The programs were run on an IBM 1802 and
permitted searches for papers according to author, subject or
element. Alternatively, searches could be made for compound type,
elemental composition or molecular weight. Intersected searches,
e.g. for all indoles with a given molecular weight, could be
effected. These programs used all the parameters that had been
used in the indexing of the Bulletin and so offered an
alternative means of using the information in the Bulletin. They
also allowed, in principle, an individual to conduct the search
without recourse to a distant computer. The single disadvantage
of the system is that because the search is of a data base that
is stored on magnetic tape, it is relatively slow; a typical
search requires about 5 min. the limiting step being the transfer
of data from the tape.
Since 1971, the use of disk, as opposed to tape, for the storage
of large, searchable data bases has been examined. The cost of
disk storage for a given data base, while considerably more than
the cost of tape storage for the same data, is currently
decreasing very rapidly, and now the cost of storing even a large
data base on disk can be manageable. The great advantage of disk
over tape is the vastly greater speed with which the information
can be located and retrieved. This rapid retrieval permits
construction of a truly interactive search system, in which the
user, having asked a question, need wait only moments to receive
an answer which may be sufficient or may permit restatement of
the question.
The NIH-EPA Chemical Information System (CIS), a collection of numerical and literature data bases together with the necessary search programs [4], uses disk-stored data bases exclusively. Much of the CIS runs on a commercial networked computer which provides on-line access to data--a feature which demands that the data be stored on disk. Each data base is organized into two or three separate disk files, as has been described for the MSSS [5]. The major one is the so-called reference file which contains all the data necessary for a search, organized in a specific way. The other file is a pointer file, which is the file that is first reached by the search program. From the pointer file, the program finds where the information needed begins in the reference file
and where it ends. The program then reads that part of the
reference file in preparing the answer to the user's query.
As an example, if the user asks for all papers by Atkins to be
located, the program goes to the appropriate pointer file, which
is simply an alphabetically ordered file of the names of all
authors whose papers are in the data base. Once the name 'Atkins'
has been found in the pointer file, it is simple to read off
starting and ending addresses of 'Atkins' entries in the
reference file, and from these two numbers, the number of
'Atkins' entries can be calculated and reported back to the user.
With this general technique, the data base can be searched in a
variety of ways. Every paper, when it is taken from the
literature, is assigned a certain number of subject keywords from
a list that currently has 327 entries. In addition, any element
that is studied specifically is noted as are the names of all the
authors and the MSDC Compound Classification Codes of compounds
described. The MSDC codes are a group of 87 four-digit codes that
represent functional groups or compound types appropriate to the
compounds discussed. In addition to these descriptors, there is
also a 'general index' of terms which do not logically belong in
the other indexes. This index includes, for example, the names of
microorganisms that have been studied and the venues of meetings
dealing with mass spectrometry.
These five descriptors, subject, element, author, compound class
and general index term, form the basis of the indexes to the
Bulletin; they can readily be stripped from the data base as it
is written on the magnetic tape and they have therefore been used
as the parameters with which the computer search can operate.
RESULTS
In the subject search, the user is asked to enter the subject of
interest. To counter ambiguities in phraseology and spelling, the
system searches through the data base for occurrence of the first
three letters provided by the user, so if the word 'electron' is
entered, entries to subject codes such as electron, electric and
electronic will be retrieved. These are then presented to the
user who is asked to select one or more of these temporary files.
The program then accepts another subject word or, if so
requested, lists the references that have already been retrieved.
Before any references are listed, however, the user is asked
whether or not references should be listed irrespective of the
year of their appearance in the Bulletin. An affirmative answer
results in all the references being printed, but a negative
answer requires the user to specify the years for which citations
should be retrieved. Once the years in question, (e.g. 73, 74,
75), are defined, only those references will be listed.
An example of such a search is given in Table 1. Here the user
was searching for any papers dealing with mass measurement in
connection with chemical ionization mass spectrometry. The first
subject word entered was 'mass measurement' and the subjects that
were encountered were mass discrimination (file 1, 194
references), mass fragmentation (266 references), mass
measurement (463), mass spectra (11 422), spark-source mass
spectrometry (817), theory of mass spectra (375), and time-resolved mass spectrometry (357)--all subject phrases containing
the word 'mass'. The user is then asked which of the seven answer
files are of interest and responds by specifying file 3. The
other files are then discarded and the user is prompted for a
second subject word. This is 'chemical ionization', which prompts
a search which leads to answer files for 'chemical binding
energy', 'chemical ionization', 'chemical reactions', and
'radiation chemistry'. Once the user selects the second of these,
chemical ionization, these 678 references are automatically
combined in a Boolean AND operation with the 463 papers
previously retrieved as relevant to mass measurement. The result
is an answer file containing the 8 papers that discuss mass
measurement and chemical-ionization mass spectrometry. These can
be listed for any or all Bulletin years, or a third subject term
can be introduced and the search continued. This complete
session, from log-in to log-off takes less than 30s and the DEC
PDP-10 cpu time required is less than 5 s.
There are currently a total of 327 specific subject code words in
the data base. These range from the relatively non-specific (e.g.
impact, electric and so on), to quite precise terms such as
scatter-electrons or quadrupole.
While the example in Table 1 illustrates the use of a Boolean AND
combination of two separate queries, the use of a NOT combination
is also possible. If the user specifies a subject word, followed
by the operator NOT, then the program will produce all the
citations that do not deal with that particular subject.
In the preparation of the Bulletin, the abstracts of papers that
deal with specific elements are so coded. Elements that are
merely part of a molecule under study (e.g. C, H. O. N etc.) are
not usually noted, but a paper dealing with the secondary ion
emission from copper surfaces, for example, would be considered
as dealing with copper, and copper would be one of the keywords.
The element search simply uses these element keywords to retrieve
all the papers that deal with specific elements. In using this
program, one has simply to specify the atomic symbol of the
element in question and the program, using the pointer
file/reference file technique discussed earlier, responds by
telling the user how many citations are retrieved. The user is
then free to list these (limiting the listing to certain years if
so desired) or enter another element and continue to narrow down
the number of citations. As in the subject search, AND or NOT
combinations are possible. Thus a search for papers dealing with
iron produced 483 references (Table 2). Only 146 papers, however,
deal with both iron and cobalt, and if those papers that also
deal with nickel are excluded, 24 references remain. This program
is of particular value to those interested in inorganic
chemistry, metallurgy and surface phenomena, and demonstrates the
breadth of coverage of the Bulletin beyond the boundaries of
organic mass spectrometry.
A third type of search is for papers published by a specific
author or group of authors. A particular problem here is that
people's surnames are, in effect, trivial names and are not
spelled predictably. Cursory examination of the Bulletin file
shows that at some step of the publication and abstracting
process, a given author's name may be changed. The author search
developed for this system attempts to deal with this problem by
prompting the user for the name of the author whose papers are
the object of the search. The actual search however, is carried
out with the first four letters of the surname. All the papers by
authors whose surnames begin with these four letters will be
retrieved and the user must then decide which are appropriate.
Table 3 gives an example of an author search in which the goal
was to locate any papers published by Aberth and Anbar of
Stanford Research Institute. When the first name, Aberth, is
entered, a list of nine authors whose names begin with 'aber' is
produced. From this list, it can be seen that the same name may
be spelled in more than one way and people's initials are far
from constant. In this case, it seems reasonable to guess that W.
Aberth and W. H. Aberth are the same person, and so index numbers
8 and 9 are selected, resulting in a file of 36 papers by one or
other of those authors. When the next author, Anbar, is
specified, the result is simpler because he is unique in having
'anba' as the first four letters of his name. When the two
files--Aberth papers and Anbar papers--are intersected, 10 papers
by Anbar and either W. Aberth or W. H. Aberth are present. The
observation that W. Aberth and W. H. Aberth both publish with
Anbar, but never together strongly suggests that they must be the
same person.
As in the searches discussed above, NOT logic is available in the
author search. For example, one could, by using the operator NOT
after Anbar's name, retrieve all the papers that the Aberths have
published without Anbar as co-author.
The use of the first four letters of an author's name as
described here has the obvious advantage that the user need not
spell the full name accurately and, significantly, need not know
the correct initials. Even if the user knows the correct
initials, the search need not be accurate because, as the example
in Table 3 shows, scientists do not use their own initials
consistently. The major disadvantage of the four-letter search is
that an entry such as 'smit' produces a total of 167 candidates,
including 159 Smiths. Use of a five-letter search would not solve
this problem, and it may prove necessary to permit the user to
supply author's initials to alleviate this difficulty. The only
inconvenience of the search as it stands however, is the time it
takes to list all the candidates; this listing does not affect
the computer costs seriously.
The published version of the Bulletin contains a 'General Index',
which in earlier issues was termed the 'Materials Index'. This
index contains entries which cannot sensibly be placed in other
indexes and yet are useful for information retrieval. Such
entries include types of ions (e.g. Ar(+)) and compounds (e.g.
metal chelates), biological classifications (e.g. Klebsiella) and
even the venues (Berlin, Dallas) of mass spectrometry meetings.
As a result, this General Index can occasionally be
extraordinarily useful as a means of locating a specific piece of
information.
The program 'Index' can be used to search through the accumulated
general indexes and operates in the same way as the subject
search described above. An example of the use of this program is
given in Table 4. In the first case, the user was able very
rapidly to locate the one paper dealing with carbon dioxide that
was presented at the 1975 NATO Advanced Study Institute held in
Biarritz.
This program, like the author search, uses the first four letters
of the index term supplied by the user. This rarely produces more
than a few possible answers; problems such as that arising from
names like 'Smith' apparently do not occur in this case. The
General Index is not static and can grow more or less rapidly as
more papers are abstracted. This raises the possibility that
newly introduced terms cannot be used immediately for searching.
Such terms will be inaccessible until the next annual update of
the searchable data base.
Dissemination of the system
The Mass Spectrometry Search System (MSBULL) is available as an
extension of the Mass Spectral Search System. this system is
maintained by MSDC on the ADP-Cyphernetics international computer
network. In addition to the searching programs, the system, like
MSSS, contains user assistance files that can be accessed by the
user who needs help with the conversational style of the system.
The user can also list the currently used subject codewords or
the MSDC compound classification codes.
We acknowledge the generous cooperation received from Dr. A.
McCormick and Mr. D. C. Maxwell of the Mass Spectrometry Data
Centre, Aldermaston, England, and from Mr. A. C. Nicholas of the
Department of Industry of the U.K. Government.
REFERENCES
1 Mass Spectrometry Bulletin, Mass Spectrometry Data Centre,
AWRE, Aldermaston, Berks., England.
2 S. R. Heller, H. M. Fales and G. W. A. Milne, Org. Mass
Spectrom., 7 (1973) 107; S. R. Heller, D. A. Koniver, H. M. Fales
and G. W. A. Milne, Anal. Chem., 46, 947 (1974); S. R. Heller, R.
J. Feldmann, H. M. Fales, and G. W. A. Milne, J. Chem. Doc., 13
(1973) 130; G. W. A. Milne and S. R. Heller, J. Chem. Inf. Comp.
Sci., (1976) in press.
3 H. S. Hertz, D. A. Evans and K. Biemann, Org. Mass Spectrom.. 4
(1970) 453.
4 S. R. Heller, G. W. A. Milne and R. J. Feldmann, Science,
(1977) in press.
5 S. R. Heller, Anal. Chem., 44 (1972) 1951.