The State of Scientific Chemical Information Activities
Stephen R. Heller
Agricultural Research Service
US Department of Agriculture
Beltsville, MD 20705 USA
With due apology to Dickens, these are the best of
times and these are the worst of times. It has been said
that we are leaving the industrial revolution age and coming
upon a new era - the information revolution. As chemists,
however, the revolution has all but begun without us. The
local, national, and international chemical communities need
and want far more organized and automated chemical,
physical, biochemical and biological data than is currently
available. And now with further apologies to John Paul
Jones, I say, we have just begun to fight.
A computer based scientific chemical information
systems requires five main components. They are:
1. Numeric Data
2. Computer Software
3. Computer Hardware
4. Computer Networks
5. System/Organization
DATA
Data, the first of the five areas, focuses around
numeric data, not bibliographic information. However,
before I begin the discussion of numeric databases I just
want to say a few words about how I think scientists will
access bibliographic databases or literature references in
the near future.
It seems quite clear that bibliographic information
will, sooner or later, be available on optical (read-only
and/or read and write) laser disks, such as the one recently
announced for sale by the Digital Equipment Corporation
(DEC) for their VAX line computer systems. The result of
such optical laser disk systems will be that every library
in the world, and probably many, research laboratories, will
have access to all the bibliographic databases they wish.
The current compact read-only optical disk which DEC sells
will hold over 200,000 single spaced typed pages on a disk.
These 120 mm (4.7 inch) disks hold the equivalent of 1600
floppy disks, or the amount of information transmitted for
46 days at 1200 baud.
With a few disks it would be possible to have all of
the 100-200 most cited journals all available on such disks,
so the researcher could easily look up a reference article
right at the same time as a literature search is taking
place. Only a critical mass of inexpensive disk readers,
and a royalty agreement amongst the major publishers is
needed. With a cost of about $10,000 to produce a master
disk and 100 copies of the master disk, the price for this
storage medium is becoming practical.
Given this scenario, text and full text systems, such
as the many American Chemical Society (ACS) journals now
available online, will be unnecessary except to have the
keywords for searching. The fact that these full text
journals, and other full text publications, are not in heavy
demand and use is, I believe, a sign that the scenario
outlined above is likely to prove true.
There are many areas of numeric data which are of
interest and value to the chemical community. These areas
include:
1. Spectroscopy
2. Toxicology and Environment
3. Physical/Chemical
4. Biochemical/Biological
SPECTROSCOPY
In the first of these areas, spectroscopy, there are
probably the most, and oldest, numeric databases. The main
databases in this area include:
1. Mass Spectrometry
2. Infrared
3. Nuclear Magnetic Resonance
4. Ultraviolet and Visible Spectra
5. Raman Spectra
6. Mossbauer Spectra
Mass Spectral data
Mass spectral databases are amongst the oldest numeric
databases in chemistry. The reason is quite simple. They
are simple data to encode into computer readable form. Even
in the early 1960's, with punch cards and tape based
systems, mass spectra were be used. Mass spectral data were
the first to be used in the NIH/EPA Chemical Information
System (CIS), which was developed by the US Government
between 1970 and 1984. Even today, while the CIS is no
longer being developed and maintained by the US Government,
the mass spectral database (MSSS) has survived and is a
joint National Bureau of Standards, Office of Standard
Reference Data (NBS OSRD) and EPA Office of Research and
Development project, although these organizations do not
make the database available online. The history of the
development and evolution of this database is useful to
discuss, since it is a large, widely used database.
The first database of the CIS, MSSS, was a collection of
8124 mass spectra provided to the NIH by Professor Biemann
of MIT. With this database in hand, a collection of
programs to search and manipulate the data and information
in this database were developed over a period of years. It
was the learning experience from this first database which
began to drive all future activities of the NIH/EPA CIS
project. For example, it was clear from almost the start
that the database was not large enough to be as useful as
tool as desired. Furthermore the multi-copies of spectra,
coupled with different names for the same materials, led to
confusion and probably to a reinforcement of the notion of
"garbage in - garbage out". The lack of a measure of the
quality of the mass spectral data was also an issue.
Lastly, the need for full time staff to obtain, enter, edit
and prepare a final product of the mass spectral database
was the most critical problem, as solving the technical
matters are usually simpler. Thus a collaboration with the
Mass Spectrometry Data Centre (MSDC) in Aldermaston (and
later Nottingham) England was initiated, through the NBS
OSRD. For a number of years the MSDC disseminated the mass
spectral database and programs to search the file, which was
given the name the Mass Spectral Search System (MSSS).
However, changes in the UK government policy, coupled with
decision of EPA management to assume a greater role in the
development and dissemination of mass spectral data (which
occurred at the time the EPA laboratories and Research Office
(ORD) had just started its efforts to use mass spectrometry
as the main tool for pollutant and toxic chemical
identification).
From 1974 to 1980 EPA, NIH and NBS funded and
collaborated in the development of a quality index (QI) for
the mass spectral database, which was comprised of some 10
data quality indicators (DQI's) which were originally
developed by McLafferty and co-workers. Unfortunately there
has been some minor changes in the DQI's between the
McLafferty database and the NBS database, and thus the QI's
in the two database today differ slightly. More about the
quality control and quality index of mass spectra will be
discussed later on.
From 1971 to early 1985, the mass spectrometry database
grew from 8124 mass spectra to 42,229. In addition
thousands of duplicate and spectra of labelled chemicals
were archived to a separate database, which now consists of
over 70,000 spectra. Coupled with this activity, Chemical
Abstracts Service (CAS) Registry Numbers were obtained for
all chemicals in the database, providing unique identifiers
for each spectrum. Associated with the addition of CAS
Registry numbers were the CAS standard index names,
molecular formula for all chemicals in the standard Hill
notation form, and a quality index for each spectrum. When
a new spectrum is received, even if it is a duplicate, it
replaces a spectrum in the database if its quality index is
higher than that of the existing spectrum in the database.
Thus a "living" database is created, which contains of both
new and replaced spectra at every yearly update.
The mass spectral database is now being made available
through the NBS OSRD as well as having been combined with
the McLafferty mass spectral database, available from John
Wiley & Sons. In addition the NBS OSRD, through the US
Government Printing Office made available a hard copy
version of the database. These books, called the EPA/NIH
Mass Spectral Data Base, were first published as a four
volume set, with an index. Over time another two volumes
were published, and the entire set now consists of six
volumes and an index which covers all six volumes.
The mass spectrometry database is now under the joint
responsibility of the NBS OSRD and the EPA Office of
Research and Development, Environmental Monitoring and
Support Laboratory (EMSL) in Cincinnati, Ohio. Data is
still being received from the MSDC, as well as contributions
from the scientific community and from an EPA contract for
new mass spectral data which is part of the ORD EMSL
activities.
It is worth noting here that the EPA EMSL activities in
running about 1000 spectra per year cost an average of $243
each. This cost includes the cost to acquire the sample
($61), to run the sample ($52) and lab overhead ($130).
These are not insignificant costs, and continue to increase,
not only due to inflation, but also due to reduced numbers
of "easy" to obtain samples.
In addition to the EI mass spectral database, there
have been some efforts in the areas of other types of mass
spectral data, both Chemical Ionization (CI) and Fast Ion
Bombardment (FAB) spectra. At present the CI database
consists of less than 2000 spectra, and the FAB activities
at MSDC are still in their formative stages.
In addition to the value of electron impact mass
spectra, the high level of interest in chemical ionization
mass spectrometry has led to a need for a reliable file of
gas phase proton affinities. The task of gathering and
evaluating all published gas phase proton affinities was
completed by Rosenstock and co-workers at NBS. This file,
which has about 400 critically evaluated gas phase proton
affinities drawn from the open literature, can be searched
on the basis of compound type or the proton affinity value.
Infrared Spectral Data
Infrared (IR) spectral data are also amongst the oldest
of numeric databases for chemists. While IR databases were
developed in the 1950's, the original database compiled
through the American Society for Testing Materials (ASTM),
was quite limited due to the complex nature of IR spectral
curves, and the considerable lack of data storage methods
and devices three decades. The result of these limitations,
which for historical reasons were carried forward to this
day, is the database is 80 column card oriented, and
contains not the spectrum, but rather significant peaks,
coded in the older micron frequency, rather than complete
spectral data coded in wavenumbers. While there are
considerable limits to this ASTM database, it does contain
some 150,000 spectra (including an unknown number of
duplicates), wt (2).
conventional memoryIn a PC, the first megabyte of memory. The termmay also refer only to the first 640K. The top384K of the first megabyte is called "high DOSmemory" or "upper memory area." See UMA andextended memory.conventional programmingUsing a procedural language.convergenceIntersection of red, green and blue electron beamson one CRT pixel. Poor convergence decreasesresolution and muddies white pixels.conversationalInteractive dialogue between the user and thecomputer.conversion(1) Data conversion is changing data from one fileor database format to another. It may also requirecode conversion between ASCII and EBCDIC.(2) Media conversion is changing storage media suchas from tape to disk.(3) Program conversion is changing the programmingsource language from one dialect to another, orchanging application programs to link to a newoperating system or DBMS.(4) Computer system conversion is changing thecomputer model and peripheral devices.(5) Information system conversion requires dataconversion and either program conversion or theinstallation of newly purchased or createdapplication programs.converter(1) Device that changes one set of codes, modes,sequences or frequencies to a different set. SeeA/D converter.(2) Device that changes current from 60Hz to 50Hz,and vice versa.cooperative processingSharing a job among two or more computers such as amainframe and a personal computer. It impliessplitting the workload for the most efficiency.coordinateBelonging to a system of indexing by two or moreterms. For example, points on a plane, cells in aspreadsheet and bits in dynamic RAM chips areidentified by a pair of coordinates. Points inspace are identified by sets of three coordinates.coprocessorSecondary processor used to speed up operations byhandling some of the workload of the main CPU. Seemath coprocessor.copyTo make a duplicate of the original. In digitalelectronics, all copies are identical. The text in thich makes it by far the largest spectral database available. The serious flaws in the