The State of Scientific Chemical Information Activities

Stephen R. Heller

Agricultural Research Service

US Department of Agriculture

Beltsville, MD 20705 USA



With due apology to Dickens, these are the best of

times and these are the worst of times. It has been said

that we are leaving the industrial revolution age and coming

upon a new era - the information revolution. As chemists,

however, the revolution has all but begun without us. The

local, national, and international chemical communities need

and want far more organized and automated chemical,

physical, biochemical and biological data than is currently

available. And now with further apologies to John Paul

Jones, I say, we have just begun to fight.

A computer based scientific chemical information

systems requires five main components. They are:

1. Numeric Data

2. Computer Software

3. Computer Hardware

4. Computer Networks

5. System/Organization



DATA

Data, the first of the five areas, focuses around

numeric data, not bibliographic information. However,

before I begin the discussion of numeric databases I just

want to say a few words about how I think scientists will

access bibliographic databases or literature references in

the near future.

It seems quite clear that bibliographic information

will, sooner or later, be available on optical (read-only

and/or read and write) laser disks, such as the one recently

announced for sale by the Digital Equipment Corporation

(DEC) for their VAX line computer systems. The result of

such optical laser disk systems will be that every library

in the world, and probably many, research laboratories, will

have access to all the bibliographic databases they wish.

The current compact read-only optical disk which DEC sells

will hold over 200,000 single spaced typed pages on a disk.

These 120 mm (4.7 inch) disks hold the equivalent of 1600

floppy disks, or the amount of information transmitted for

46 days at 1200 baud.

With a few disks it would be possible to have all of

the 100-200 most cited journals all available on such disks,

so the researcher could easily look up a reference article

right at the same time as a literature search is taking

place. Only a critical mass of inexpensive disk readers,

and a royalty agreement amongst the major publishers is

needed. With a cost of about $10,000 to produce a master

disk and 100 copies of the master disk, the price for this

storage medium is becoming practical.

Given this scenario, text and full text systems, such

as the many American Chemical Society (ACS) journals now

available online, will be unnecessary except to have the

keywords for searching. The fact that these full text

journals, and other full text publications, are not in heavy

demand and use is, I believe, a sign that the scenario

outlined above is likely to prove true.

There are many areas of numeric data which are of

interest and value to the chemical community. These areas

include:

1. Spectroscopy

2. Toxicology and Environment

3. Physical/Chemical

4. Biochemical/Biological

SPECTROSCOPY

In the first of these areas, spectroscopy, there are

probably the most, and oldest, numeric databases. The main

databases in this area include:

1. Mass Spectrometry

2. Infrared

3. Nuclear Magnetic Resonance

4. Ultraviolet and Visible Spectra

5. Raman Spectra

6. Mossbauer Spectra

Mass Spectral data

Mass spectral databases are amongst the oldest numeric

databases in chemistry. The reason is quite simple. They

are simple data to encode into computer readable form. Even

in the early 1960's, with punch cards and tape based

systems, mass spectra were be used. Mass spectral data were

the first to be used in the NIH/EPA Chemical Information

System (CIS), which was developed by the US Government

between 1970 and 1984. Even today, while the CIS is no

longer being developed and maintained by the US Government,

the mass spectral database (MSSS) has survived and is a

joint National Bureau of Standards, Office of Standard

Reference Data (NBS OSRD) and EPA Office of Research and

Development project, although these organizations do not

make the database available online. The history of the

development and evolution of this database is useful to

discuss, since it is a large, widely used database.

The first database of the CIS, MSSS, was a collection of

8124 mass spectra provided to the NIH by Professor Biemann

of MIT. With this database in hand, a collection of

programs to search and manipulate the data and information

in this database were developed over a period of years. It

was the learning experience from this first database which

began to drive all future activities of the NIH/EPA CIS

project. For example, it was clear from almost the start

that the database was not large enough to be as useful as

tool as desired. Furthermore the multi-copies of spectra,

coupled with different names for the same materials, led to

confusion and probably to a reinforcement of the notion of

"garbage in - garbage out". The lack of a measure of the

quality of the mass spectral data was also an issue.

Lastly, the need for full time staff to obtain, enter, edit

and prepare a final product of the mass spectral database

was the most critical problem, as solving the technical

matters are usually simpler. Thus a collaboration with the

Mass Spectrometry Data Centre (MSDC) in Aldermaston (and

later Nottingham) England was initiated, through the NBS

OSRD. For a number of years the MSDC disseminated the mass

spectral database and programs to search the file, which was

given the name the Mass Spectral Search System (MSSS).

However, changes in the UK government policy, coupled with

decision of EPA management to assume a greater role in the

development and dissemination of mass spectral data (which

occurred at the time the EPA laboratories and Research Office

(ORD) had just started its efforts to use mass spectrometry

as the main tool for pollutant and toxic chemical

identification).

From 1974 to 1980 EPA, NIH and NBS funded and

collaborated in the development of a quality index (QI) for

the mass spectral database, which was comprised of some 10

data quality indicators (DQI's) which were originally

developed by McLafferty and co-workers. Unfortunately there

has been some minor changes in the DQI's between the

McLafferty database and the NBS database, and thus the QI's

in the two database today differ slightly. More about the

quality control and quality index of mass spectra will be

discussed later on.

From 1971 to early 1985, the mass spectrometry database

grew from 8124 mass spectra to 42,229. In addition

thousands of duplicate and spectra of labelled chemicals

were archived to a separate database, which now consists of

over 70,000 spectra. Coupled with this activity, Chemical

Abstracts Service (CAS) Registry Numbers were obtained for

all chemicals in the database, providing unique identifiers

for each spectrum. Associated with the addition of CAS

Registry numbers were the CAS standard index names,

molecular formula for all chemicals in the standard Hill

notation form, and a quality index for each spectrum. When

a new spectrum is received, even if it is a duplicate, it

replaces a spectrum in the database if its quality index is

higher than that of the existing spectrum in the database.

Thus a "living" database is created, which contains of both

new and replaced spectra at every yearly update.

The mass spectral database is now being made available

through the NBS OSRD as well as having been combined with

the McLafferty mass spectral database, available from John

Wiley & Sons. In addition the NBS OSRD, through the US

Government Printing Office made available a hard copy

version of the database. These books, called the EPA/NIH

Mass Spectral Data Base, were first published as a four

volume set, with an index. Over time another two volumes

were published, and the entire set now consists of six

volumes and an index which covers all six volumes.

The mass spectrometry database is now under the joint

responsibility of the NBS OSRD and the EPA Office of

Research and Development, Environmental Monitoring and

Support Laboratory (EMSL) in Cincinnati, Ohio. Data is

still being received from the MSDC, as well as contributions

from the scientific community and from an EPA contract for

new mass spectral data which is part of the ORD EMSL

activities.

It is worth noting here that the EPA EMSL activities in

running about 1000 spectra per year cost an average of $243

each. This cost includes the cost to acquire the sample

($61), to run the sample ($52) and lab overhead ($130).

These are not insignificant costs, and continue to increase,

not only due to inflation, but also due to reduced numbers

of "easy" to obtain samples.

In addition to the EI mass spectral database, there

have been some efforts in the areas of other types of mass

spectral data, both Chemical Ionization (CI) and Fast Ion

Bombardment (FAB) spectra. At present the CI database

consists of less than 2000 spectra, and the FAB activities

at MSDC are still in their formative stages.

In addition to the value of electron impact mass

spectra, the high level of interest in chemical ionization

mass spectrometry has led to a need for a reliable file of

gas phase proton affinities. The task of gathering and

evaluating all published gas phase proton affinities was

completed by Rosenstock and co-workers at NBS. This file,

which has about 400 critically evaluated gas phase proton

affinities drawn from the open literature, can be searched

on the basis of compound type or the proton affinity value.

Infrared Spectral Data

Infrared (IR) spectral data are also amongst the oldest

of numeric databases for chemists. While IR databases were

developed in the 1950's, the original database compiled

through the American Society for Testing Materials (ASTM),

was quite limited due to the complex nature of IR spectral

curves, and the considerable lack of data storage methods

and devices three decades. The result of these limitations,

which for historical reasons were carried forward to this

day, is the database is 80 column card oriented, and

contains not the spectrum, but rather significant peaks,

coded in the older micron frequency, rather than complete

spectral data coded in wavenumbers. While there are

considerable limits to this ASTM database, it does contain

some 150,000 spectra (including an unknown number of

duplicates), wt (2).



conventional memoryIn a PC, the first megabyte of memory. The termmay also refer only to the first 640K. The top384K of the first megabyte is called "high DOSmemory" or "upper memory area." See UMA andextended memory.conventional programmingUsing a procedural language.convergenceIntersection of red, green and blue electron beamson one CRT pixel. Poor convergence decreasesresolution and muddies white pixels.conversationalInteractive dialogue between the user and thecomputer.conversion(1) Data conversion is changing data from one fileor database format to another. It may also requirecode conversion between ASCII and EBCDIC.(2) Media conversion is changing storage media suchas from tape to disk.(3) Program conversion is changing the programmingsource language from one dialect to another, orchanging application programs to link to a newoperating system or DBMS.(4) Computer system conversion is changing thecomputer model and peripheral devices.(5) Information system conversion requires dataconversion and either program conversion or theinstallation of newly purchased or createdapplication programs.converter(1) Device that changes one set of codes, modes,sequences or frequencies to a different set. SeeA/D converter.(2) Device that changes current from 60Hz to 50Hz,and vice versa.cooperative processingSharing a job among two or more computers such as amainframe and a personal computer. It impliessplitting the workload for the most efficiency.coordinateBelonging to a system of indexing by two or moreterms. For example, points on a plane, cells in aspreadsheet and bits in dynamic RAM chips areidentified by a pair of coordinates. Points inspace are identified by sets of three coordinates.coprocessorSecondary processor used to speed up operations byhandling some of the workload of the main CPU. Seemath coprocessor.copyTo make a duplicate of the original. In digitalelectronics, all copies are identical. The text in thich makes it by far the largest spectral database available. The serious flaws in the