Numerical Data Acquisition, Dissemination, and Retrieval Systems
Stephen R. Heller
USDA, ARS, Beltsville, MD 20705-2350 USA
Abstract
Scientific data acquisition, dissemination, and retrieval methods
are continuing to increase. As both information specialists and
working scientists realize the need for and value of good
scientific data, more activity in this area is undertaken. This
paper will describe some problems in this field, and will propose
the creation of a knowledge base for spectroscopy as a means of
helping to solve the problem of chemical structure elucidation.
INTRODUCTION
The topic of the 8th International Conference on Computers in
Chemical Research and Education (ICCCRE) Key-note and Focused
Discussion session on numerical data involved three areas. They
are data acquisition, data dissemination, and data retrieval.
Numerical Data suffers from a lack of all three, but mostly from
shortcomings in data acquisition. By acquisition I mean
obtaining the data, either from instruments or calculations,
evaluation of the data, creating a complete description of the
data, and putting the resulting complete record into a readily
accessible computer readable form. Furthermore I also mean the
intelligent and logical organization of these data in database
that are more than just good and useful databases, namely
knowledge bases. Knowledge bases, as used here for spectroscopy
are correlations of chemical structure(s) associated with the
spectral data.
DATA ACQUISITION
In the data acquisition process there are four areas for which one needs information before the data record is complete. These areas are:
1. Data Parameters
2. Actual Data
3. Experimental Conditions
4. Sample Description (e.g., purity)
The data parameters would be a list of initial and final values
(such as in a spectrum), the minimum and maximum values (such as
0 to 100%, or if appropriate added decimal places), the units for
the data, the number of data points, the chemical name
(preferably using standard nomenclature), the Chemical Abstracts
Service Registry number and/or other identification numbers, and
the data format. The actual data are simply the numbers (with
the proper significant digits) from the instrument or experiment.
The experimental conditions, while obvious and clear, are often
left out or reported in an incomplete manner. The sample
description would also include the source, purity, method(s) of
purity and identification verification, and so forth for the
chemical. Ideally all of these items would be included in a
database. Sadly it must be said that this is not the case and a
truly complete record rarely is found. This is partly because of
a lack of knowledge and understanding of database activity, which
I define as the generally narrow view of many who create
databases. By this I mean that many databases have been put
together by a researcher in the field who has little or no
interest in data per se, but rather only is interested in the
specific data of their field of interest. There is also the
lack of resources, which is a more difficult problem to solve.
In almost every computer system, be it a dedicated data system on
an instrument, an in-house computer system, or a large online
system, there are two main problems. The biggest problem is the
lack of sufficient data parameters on a large number of
chemicals. The quality of the data is a close second, but it is
the opinion of the author that the main problem is the lack of
data. This lack of sufficient data has been the case for many
years with little reason to believe there will be any change.
One possible solution to this would be to make the maximum use of
what data there are by taking these data and information and
creating a knowledge base of spectral correlations. This matter
will be addressed in more detail later on in this paper.
Before continuing on about data quality and/or data completeness,
it is useful to say a few words, by using an example, about why
it is really important. At a recent database and online meeting
on CD-ROM (Compact Disk -Read Only Memory) technology there were
considerable discussions on what is available on CD-ROM. There
is a great deal of interest in putting all types of databases on
CD-ROM's for ease of dissemination, ability to handle large
volumes of data, and so forth. When discussing the cost of a CD-ROM, there is the pre-mastering cost, the mastering cost (to
prepare data for the master disk which will then be used to make
all the copies), and the copying cost. To make a point, the
answer was given by discussing the steps in reverse order. Once
the master disk is prepared, the actual cost to copy a disk is a
few US dollars (perhaps $3- $5). The mastering cost, which
requires a machine which costs perhaps $500,000 to $1,000,000 is
about $1,000 - $2,000. The pre-mastering costs were estimated at
$0 to $100,000. The $0 cost was for the example yet to be
achieved. That is if you have a "clean" and complete database,
then there is no problem, and hence lower costs. However, many
CD-ROM projects start out by first having to create the database.
Data creation or data acquisition, is the expensive, time
consuming, and labor intensive part of a database project.
Data Quality
The situation with data quality is improving, albeit slowly. As
time goes on the community is realizing that good data
acquisition starts from the time you begin to plan the database
collection activity. Work on the mass spectral database Quality
Index (QI) (1) has continued under the leadership of Bill Budde
of the US EPA. Budde has analyzed the meaning of the elements
which make up the QI, and has made some modifications based on
his experience (2,3). This is as it should be in science. When
new information is obtained or old data are reanalyzed, changes
often occur. This is because the QI is not an absolute measure
of quality, but rather a relative measure, and most importantly,
a statement that the quality is defined. In the area of NMR,
Bremser (4), after discussions at the previous ICCCRE in 1985 (5)
has published a first version of a quality indicator for 13-C NMR
spectra. This quality indicator is composed of five components,
but does not yet include any component which is concerned with
sample quality, preparation and source, and instrument
conditions, and other parameters which would affect the quality
of a spectrum.
In the area of infrared (IR) spectroscopy, a first step is now
underway under the direction of Wilkins and Griffiths (6) to
create a quality index for FT-IR (Fourier Transform IR) spectra.
(Non FT spectra will be accepted, but given an arbitrary lower
rating, assuming all other parameter evaluations and factors are
the same.) These researchers have decided not to accept any data
which was not collected directly in computer readable form as
these spectra are much more subject to clerical data errors.
They have proposed a three-digit quality index. Each digit lies
within the range of 0 (low) to 9 (high). The first digit
represents a measure of the sample authenticity and purity. The
second is a measure of the sample preparation. The last of the
three digits refers to the instrumental operating conditions.
The authors believe this will allow the user of this proposed
database to discriminate easily between, for example, badly
measured spectra of pure samples, and well measured spectra of
relatively impure samples. The draft questionnaire used to
determine the values of the three digits is seven pages long. By
the time these scientists and their advisory committee make the
first draft available to the public, it will probably be longer
and more detailed. Of course a more detailed list of information
which is requested to be filled out may very well decrease the
cooperation and response the authors receive. This desire or
need to have detailed information on each spectrum is another
reason that a knowledge base which contains correlations based on
many observations, but not containing the details necessary for a
library spectrum has considerable attraction and value. But
still this effort should be encouraging to the IR community in
particular, and the scientific community in general.
I believe that as each of these data evaluation projects is
started and maintained, it will lead both to improvements in
defining the quality of scientific data, and additional pressure
on other scientific groups to develop a quality index parameters
for their data.
In addition to having better databases I also believe these data
quality evaluations will allow for the development of knowledge
bases. This is because as evaluators begin to examine
collections of data for determining how the data are best
evaluated, they will begin to think in terms of correlations from
the existing data collected. These collections will, in time,
lead to the creation of what are called knowledge bases. As
stated above, in simple terms a knowledge base can be defined as
a collection of correlations of data and the partial or full
chemical structure(s) associated with the data. Appendix 1
describes a draft proposal I have proposed to create a knowledge
base of spectral data which will consist of spectral data and the
structure, sub-structure, and/or structures which are associated
with the data. This knowledge base will be a valuable resource
to all spectroscopists. The knowledge will also identify "data
gaps", which are areas for which there are little or no data on
which to make any correlation, or areas in which one is not be
able to make an acceptable correlation. (Having only one example
of a data value makes it a bit hard to have a clearly valid
correlation, and even more difficult to have an acceptable
standard deviation.) Lastly, the knowledge base will help in
the data evaluation by identifying possible data outliers, which
would make the scientist examine the data and experimental
conditions more carefully and closely.
DATA DISSEMINATION
The subject area of data dissemination is easier to deal with, since the matter of obtaining and preparing the data are over at this point. There are three main media with which to disseminate data. They are:
1. Computer tapes
2. Floppy Disks
3. CD-ROM
Computer tapes offer an excellent way to disseminate large
databases. Today many labs can read standard 800/1600/6250 bpi 9
track tapes. As long as the provider of the data documents the
tape format and data elements it should be an easy matter to read
a tape of data. It is usually up to the user to create a
searchable version of the data. However, in some cases it may be
useful or desirable to provide the user with search and retrieval
software. Generally most search and retrieval software is
written by and for the programmer in the laboratory and the
particular environment of that laboratory. Thus most software
written do not perform the all functions desired, or performs
them in a unique manner which is tailored to a particular
environment.
With the enormous growth in the number of personal computers, it
is becoming more likely that databases (and in the cases of large
databases - subsets of databases) will be made available on
floppy disks. The data on these floppy disks can also easily be
put onto hard disks. With storage capabilities of 360 K to 1.2
MB for 5 1/4 inch floppy disks for IBM PC's and their
equivalents, and 720 K - 1.44 MB for the newer 3 1/2 inch floppy
disks, quite a lot of data can be made readily available to
anyone. IUPAC has recently established a new committee called
the Committee on Chemical Databases (7). This committee is
planning to disseminate IUPAC databases on tape, and floppy
disks. Since most of the IUPAC databases are small, the floppy
disk is an ideal medium for dissemination. It is also ideal
because most members of IUPAC has, or has access to, a computer
with a floppy disk. The National Bureau of Standards, Office of
Standard Reference Data (NBS - OSRD) is about to release a copy
of their mass spectral database, along with a search and display
software system which will include a database of about 45,000
spectra (8). All this will easily fit on a 20MB hard disk. The
search times are quite fast (in the order of a few seconds) and
at a price of $750, this seems likely to increase the use of such
data and the general interest in numeric spectral data.
CD-ROM (Compact disk - Read only memory) is a dissemination
medium of great potential. While there are few CD-ROM disk
drives in computer systems today, their numbers are expected to
increase considerably in the next few years. The cost of
mastering a disk is high, but if 100 - 2000 copies can be sold,
there are great economies of scale. At present the publishing
company of John Wiley & Sons is experimenting with a mass
spectrometry database on a CD-ROM, and the number of sales in the
first few months is low. But Wiley should be given credit for
exploring the market and seeing what the community wants. The
Wiley system comes with a search program to search and display
the resulting search hits.
Before ending the discussion on data dissemination it is worth
noting an interesting and valuable project of the US Joint
Committee on Atomic and Molecular Physics (JCAMP) (9). The
JCAMP-DX (DX stands for data exchange) is a project for having a
common data exchange format for IR spectra from different
instruments and manufacturers. This would allow IR data to be
readily exchanged via tape, over telephone lines, or by other
means. The most promising aspect of this project is the format
exchange appears to be broad enough to be used for other forms of
data exchange.
RETRIEVAL SYSTEMS
There are few online retrieval systems for scientific numerical
data (10). Most of the online systems, such as ORBIT and DIALOG,
are almost exclusively bibliographic. The NIH/EPA Chemical
Information System (CIS) (11) is a major source of online numeric
databases, as is the Numerica system from Technical Data Services
(12). The Beilstein Institute is developing its own numeric data
retrieval system for use with the Beilstein Handbook database,
which will become available online in 1988 (13). There are
other systems, but these are all much smaller than the CIS or
Numerica systems at present, and generally contain only one type
of data, such as crystallographic data, powder diffraction data,
or chemical Material Safety Data Sheets (MSDS). In the future it
is expected that the Chemical Abstracts Service (CAS) Science and
Technology Network (STN) will expand the capabilities of their
Messenger software so it will have the capability to search
scientific data (such as the Beilstein factual database) and the
bibliographic text data it is now able to search.
CONCLUSION
The data acquisition process continues to improve, and data
quality and data evaluation are getting more time and attention
from scientists and database organizers. The development of
quality index values for spectral databases is increasing in the
numbers of spectral data areas covered and is becoming more
routine in being accepted by scientists. It would be desirable
for the process to move even more quickly, and hopefully more
discussion, and more use of these types of data will speed up
this process.
The need to proceed to the next step of creating knowledge bases
of from spectral databases is an idea whose time has come. It is
expected that such a project will be started in the near future
and will result in both more valuable information and knowledge
for the spectroscopists and a better understanding of the data
contained in a spectrum.
Data dissemination is being handled in an acceptable manner, with
CD-ROM technology having a promising future. The searching and
retrieval of numerical data are generally so far ahead of the
databases that this is an area which requires less attention. No
matter how good or efficient the computer software may be, it
cannot overcome poor, missing, or incomplete data.
Appendix 1
A Proposal for a Knowledge Base for Spectral Interpretation
Systems
INTRODUCTION
There have been many efforts over the past three decades to use
computers for spectral interpretation and identification (14).
These efforts have centered around two areas, namely library
searching and expert systems.
The first area is library search identification using spectral
databases. Library search systems are now commonly available as
part of mass spectrometer, infrared spectrometer, and other
spectrometer data systems. They are also available in online
timesharing systems. These systems, while varying in quality and
use, all suffer from the lack of a sufficiently large, diverse
(in terms of different types of chemical structures), and high
quality of data so as to limit their use.
The second area is Expert Systems (ES), which are part of the
field of artificial intelligence (AI). The DENDRAL (15) program
was the first of these systems. As Zupan has pointed out its
actual success has been somewhat less than claimed (16). Other
systems, such as CASE (17) and CHEMICS (18) also have had limited
success and certainly are not used routinely in labs. The major
problem with these systems has been that no one laboratory has
the resources to create a large enough knowledge base to interest
a broad user community in testing, evaluating, and using the
system. While AI and ES are in vogue, and considerable interest,
technical and commercial, has been shown in ES in the past few
years, most of this has been form, not content. Existing ES are
a primitive technology primarily because they lack large and high
quality knowledge bases.
In his Presidential Address at the 1985 American Association of
Artificial Intelligence meeting (19) Woody Bledsoe stated: "I
believe that it is time to build large, very large knowledge
bases. Such a knowledge base should contain common sense
knowledge and encyclopedic and expert knowledge and be structured
to handle learning and performance requirements ... It is
believed that such a large-structured knowledge base would not
only allow the sharing of knowledge by numerous systems, but, if
structured correctly, could provide much more robustness and
functionality than is possible from a number of distinct smaller
knowledge bases."
PROPOSAL
It now appears reasonable to conclude that the most critical need for an expert system for spectral structure elucidation is a large and diverse knowledge base. Such a knowledge base, encompassing the areas of mass spectrometry, IR, NIR, NMR (13-C and H), UV, VIS, and possible others, such as Raman, would be of
immense and critical value to the scientific community. There
are no current projects which this proposal would duplicate.
There are a number of spectral database projects, in the public
and private sector (20), but these are "just" databases.
Instrument companies have been experimenting with the use of ES
for instrument tune-up and control (21), and one has even
incorporated an existing ES into their data system (22). There
is an excellent collection of tables of structure-spectral
correlations which has been published in book form (23), but is
not yet computerized. Hopefully this information could be used
as part of the knowledge base to be developed.
It is proposed that an international collaborative project be
established to create and maintain a knowledge base of spectral-structure relations. Such a knowledge base will contain the
accumulated wisdom, knowledge, and experiences of experts in all
fields of spectroscopy. In effect we will be creating a Ph.D. in
spectroscopy. Like getting a Ph.D., this will take time - a
number of years. However, besides creating a valuable scientific
resource, this effort will identify knowledge gaps; that is,
areas for which there are no known or agreed to structure-spectral data correlations. These gaps can become areas for new
scientific research and investigations in the particular area(s)
of spectroscopy identified.
It is suggested that the responsibility for the knowledge base maintenance and distribution reside with the ARS Model and Database Coordination Laboratory (MDCL) in Beltsville, Maryland, USA. MDCL has available a VAX/MicroVAX system which is networked via Telenet, BITnet, and ARPAnet, allowing for easy access to send in knowledge rules for the knowledge, to discuss the accuracy and validity of rules via electronic mail and electronic conferencing available on the system, and to allow for dissemination of the knowledge base, via the network or on magnetic tape. It would be expected that all collaborating members would be provided with a copy of the current knowledge base as "payment" for their contributions. Details of how others would obtain a copy will need to be worked out. One likely possibility would be to have IUPAC act as the organization for the dissemination of this knowledge base. With the recent establishment of an IUPAC Committee on Chemical Databases (7), there is both a potential source of seed funding for the project and a well known and respected scientific organization to lead the project. CODATA and JCAMP, two other organizations involved with scientific data and spectroscopy, cou