Stephen R. Heller
USDA, ARS
Building 005
Beltsville, MD 20705-2350 USA
The chemistry symposium at the 1991 London Online meeting
focuses on numeric databases in chemistry. The source of most of
the presentations this year itself speaks to the issue of the
title of this paper. Three of the four talks are from Germany, a
country which was the center of chemistry many years ago and
which has recently undergone a renaissance in this area. Today
it is one of the two leading countries with centers of chemical
information, particular numeric and factual chemical data.
Numeric data in chemistry have been compiled for many years,
with the Beilstein Handbook being one of the first compilations.
Since the early 1970's, there has been a considerable amount of
activity throughout the world in the creation and maintenance of
numeric databases in chemistry. Most of the initial activities
came from the USA, particularly the NIST Office of Standard
Reference Data (1), with some efforts being undertaken in Europe
and even a few from Japan.
The title of this paper is "The Economic Future of Numeric
Databases in Chemistry", because economics is the main problem is
in this area. The issue of the economics of numeric data in
chemistry is not new (2). The related issues of the technical
quality and the quantity of data, while both quite important,
plays a secondary role compared to the major issue of economics.
The reason is that the cost of data quality, which involves
mostly highly trained and educated manual labor, is increasing
much faster than the usage of the information. When this is
coupled with a relative lack of quantity of data, in comparison
to bibliographic data, the result is that income derived from
numeric databases is quite low (3). As Weiske has pointed out
(4), "the production of factual data (numerical, structural,
reaction databases) is more expensive than that of bibliographic
databases. It is therefore understandable that various national
institutions and international institutions have been
participating in creating such datafiles." Thus it seems clear
that the costs relative to the income of numeric databases is
such that profit making companies have virtually stayed away from
this area of databases.
There are two major international scientific organizations
which are heavily involved in scientific data. The first is
CODATA (5), the Committee on Data for Science and Technology.
CODATA was established in 1966 and is concerned with all types of
quantitative data resulting from experimental measurements or
observations in the physical, geographic, meteorologic,
biological, geological, and astronomical sciences, and so on.
The second is IUPAC, the International Union of Pure and
Applied Chemistry. Besides being involved in printed data
compilations, such as the Solubility Data Series publications,
IUPAC recently became involved in computer readable databases.
After a study of the matter, promoted on by successive IUPAC
Presidents Schneider, Rao, and Koptyug, IUPAC established in 1985
a Committee on Chemical Databases (6). To date this group has
produced one small database on enthalpies of vaporization (7) and
a number of others are in various stages of preparation for
dissemination to the scientific community.
One of the problems with the economics of numeric data is
what may be thought of at first as a contradictory statement.
There is both too little and too much data. The problem of the
small volume of data makes it difficult to attract users to
search or use the database, while the cost of storing and
searching the large numeric databases leads to very costly
systems and access fees.
For the matter of large volumes of data two examples will
are presented, the first of which is in biology. While the
nucleotide sequence database may contain over 50 million base
pairs, the total human or corn genome have over 3 billion base
pairs each! Add to that the number of bases in a few of the
other more important genes (yeast, e-coli, mouse, cow, wheat,
rice, and so on), so one sees that is a lot of data to be stored,
and even with the ongoing reduction in the costs of disk storage
fees these costs are not trivial. In chemistry the largest
collection of scientific data is Beilstein Online. However as
one looks carefully at the database one soon discovers that,
while there are over 3.4 million compounds in the database, very
few have much data. That is, of the almost 400 data fields, only
a fraction contain data. For example, in the Beilstein database
of the 3.4 million compounds, there are 537,132 compounds with
boiling points and 3,011 with enthalpy of formation values. Thus
a user might not be pleased to find so little data for the
particular information they want. Furthermore, even if the same
user found the one (or a few) data values of interest, it is not
likely the user would go back to the database for more of the
same information for two reasons. The first is that all the
existing numeric data in the database has been found from the
first search. The second is that the database is not updated
very often (certainly not weekly or daily as bibliographic
databases are generally updated). Thus one has a situation where
large volumes of high quality data needs to be stored and is
likely to be accessed relatively infrequently. And it has taken
150 years of the scientific chemical literature and years of
careful evaluation by the chemists at the Beilstein Institute to
get even this much data which has been published!
In the area of thermodynamics and material properties data,
the subject of a detailed presentation in this symposium, the
same situation exists. The amount of data is so small as to
preclude the possibility of massive use of the database. The
same can be said for the data contained in over 500 volumes of
the inorganic Gmelin Handbook database.
One of the more successful numeric database is the NIH/EPA
mass spectral database, which is now being maintained and
distributed by the US Government agency NIST (8). This database
of a little over 50,000 spectra of organic compounds is of a very
good quality, but remains small for a number of reasons. One is
because so little good published mass spectral data can be
extracted from the literature for such a database. The second
reason is that to run mass spectra from scratch, the cost to
obtain a sample and run it on mass spectrometer exceeds $ 250 per
sample. Why then has it been so successful (with revenues of
almost $ 0.5 million per year)? One reason is that the US
Environmental Protection Agency (EPA) has demanded that this
database be used in all contract and regulatory chemical
analysis. Hence due to government regulations, this database is
widely used. Similar efforts by NIH, EPA, and NIST in the fields
of infrared (IR) and nuclear magnetic resonance (NMR)
spectroscopy have met with a much, much lower level of acceptance
and use. I believe the reason for this is the lack of a large
database coupled with the absence of a regulatory requirement.
In Germany there has been a effort recently to overcome the
lack of use of spectral databases with an approach that has
considerable promise. Based on the original database and
software development work of Bremser and his colleagues at BASF,
Chemical Concepts (a company jointly funded by the German
government and industry) has started to market NMR and IR
databases along with powerful analysis programs. These programs
are designed to answer the perceived need for very large
databases for spectral identification of unknowns. Details of
this activity are given in a separate paper as part of this
meeting.
Additionally the obvious fact the these databases don't
contain as much information as desired, are there other reasons
for the low usage? I would offer a few suggestions which address
this question. First, much numeric data do not appear the
scientific literature. Journal publishers are quite cost
conscious. Thus they, and the journal editors, want as many
papers as possible, using as little space and paper as possible.
Authors are interested in publishing to enhance and advance their
careers. Who then is there to look after the larger and longer
range question of data as a foundation for future scientific
work? Even the recent policy changes which allow for additional
data to be submitted to journals, with such information in
supplementary materials, there has been little overall
improvement in the situation. Again, in general, no one gets
credit for publishing supplementary materials.
Second, journal publishers don't pay for what they publish.
For the most part, scientists quite willingly submit their
research results for nothing, and the journals pay essentially
nothing to scientifically process the papers being published.
The cost of the editors, advisors, and reviewers is rather
minimal. Physically publishing, marketing, and selling the
journal is where the costs are. Once the printed journal is
published there is essentially no ongoing cost for maintenance,
updating, or corrections.
Third, all publishers sell, at a single price, everything
they publish, be they scientific publishers, or publishers of a
daily newspaper. This includes materials readers don't want. In
any given journal, exactly how many articles do you read, let
alone want? The publisher is able to sell pages and pages of
articles which the reader will never look at! (I believe that is
one of the reasons why the ISI Current Contents is so popular and
valuable.) Thus the reader (or in most cases the library) buys a
product with all the accessories, bells and whistles included,
and all at a single price (even if this price be a subsidized one
as in the case of individuals or, in some cases, non-profit
organizations). When you go into a computer readable online (or
even PC based) database you are able to quickly (and cheaply)
find out if what you want is there, and if so, get it directly
and quickly. If it is not there, then you quickly leave the
system. That is hardly the sort of economic incentive to
convince companies to invest in numeric databases for their
future well being and economic survival.
In summary the outlook for commercially viable numeric
databases is poor and there is little reason to believe it will
improve in the near future. Some governments, domestic, and
international organizations realize they must subsidize such
activities. Thus the economic problems have been counter-balanced by the longer term policies, politics, and foresight of
such groups. Overall it would seem that things are in better
shape than one would expect at this time. Hopefully over time
the recognized value of these activities will swell, usage will
increase, and it will become more apparent to the scientific
community as to the value of this information. Thus as the
subsidies begin to dwindle, the likelihood for these databases
becoming economically viable will improve.
References
1. D. R. Lide, "Critical Data for Critical Needs", Science, 212,
1343-1349 (1081).
2. S. R. Heller, "The Economics of Online Data Dissemination",
Proceedings of the 7th International CODATA Conference, pages
578-585, Ed. P. S. Glaeser, Pergamon Press (1981).
3. Harry Collier, "Strategies in the Electronic Information
Industry - A Guide for the 1990s", by Harry Collier (1991).
Published by Infonortics Ltd., 9A High Street, Calne, Wiltshire,
SN11 OBS, England. ISBN#: 1 873699 00 X.
4. C. Weiske, "Chemical Information in a Changing Europe", Kemia-Kemi, 18, 23-25 (1991).
5. For details about CODATA, please contact the CODATA Executive
Secretary: Mrs. Phyllis Glaeser, CODATA, 51 Blvd. de Montmorency
75016 Paris, France.
6. For details on the IUPAC Committee on Chemical Databases
(CCDB), please contact the CCDB secretary: Dr. C. Jochum,
Beilstein Institute, Varrentrappstrasse 40-42, Carl-Bosch-Haus
D-6000 Frankfurt/M. 90, Germany.
7. ENTVAPOR, a retrieval and computation system is available from
the IUPAC publisher, Blackwell Scientific Publications Ltd., PO
Box 88, Oxford, UK. The price of this IBM PC based product is
$150.00.
8. NBS Mass Spectral Database, PC Version 1.02 (Database 1-A).
Program by Dr. Stephen E. Stein, National Bureau of Standards,
Office of Standard Reference Data, Building 221, Room A-325,
Gaithersburg, MD 20899 USA. The price of this database is
$750.00.