Abstract: The growth in non-bibliographic factual databases in the field of chemistry and
related sciences is an important phenomenon to both the information scientist and the
working chemist. This presentation, and the four following talks in the chemistry session
of the International Online '86 Information Conference, will provide one with both an
overview of the field, as well as some specific details on how such databases are created,
evaluated, maintained, and used by the community.
1 INTRODUCTION
As chemists and physical scientists all know, Chemical Abstracts is one of the largest and most
widely used bibliographic databases in the world today. It is quite large in size, and broad in
coverage. It contains a great deal of information, but little factual data. Furthermore it contains
abstracts which report whatever the authors choose to say. Thus these abstracts rarely contain
evaluations or quality control, but just report, in an objective manner, what is presented. (Of course
one can argue that author abstracts, now being used more often by abstracting services are not as
objective as an independent abstractor.) What is contained in such bibliographic databases is
scientific quantity, not scientific quality. These abstract entries, which often have errors, are
normally never corrected, not even for the online versions.
In contrast, non-bibliographic or factual databases contain data, numbers, and perhaps even some
knowledge. While there may be over seven million chemicals reported in the scientific literature,
there are data on only a portion of these. Considering the Beilstein database dating back about 150
years which does contain data on organic chemicals has about half the number of compounds which
are in the Chemical Abstracts Registry database. Since it is presently not possible to compare the
Beilstein and Chemical Abstract structures, there is no way to examine the question of what
chemicals may be missing from each database. However, it is known that Chemical Abstracts does
have chemicals in its Registry System which do not exist, and hence there could be no data for these
chemicals. For example, if an author reports energy calculations indicating that a particular
compound cannot be made, the compound, because it is discussed in the paper, is assigned a CAS
Registry Number, and thus becomes a material which has been "reported in the literature". While
this is true from an abstracting and librarian point of view, it is not clear that scientists would
consider this "reported in the literature".) Furthermore, in the case of most of these chemicals, the
data reported is often incomplete or partial form and much of the data is conflicting. Collecting all
chemicals referenced is not a data quality activity and perhaps even does a disservice to the chemical
community by implying factual information that may not exist.
2 GROWTH AND QUALITY OF FACTUAL DATABASES
The growth in non-bibliographic factual databases in chemistry has been considerable over the past
few years, in quantity with respect to the types of databases, but not with respect to the actual quality
of data. The reason is quite simple. Scientific data requires careful and professional examination
and evaluation of information before such data should be allowed to be put into a database. If there
are any gross and obvious errors, these will usually stand out like a sore thumb to knowledgeable
professionals, but it is always not clear people are aware of all the data values for a given property
and what can be done with conflicting data. As noted by Lide (1) in an interesting paper, entitled
"Critical Data for Critical Needs", there is a marvelous diagram of some 200 reported measurements
of the thermal conductivity of copper which demonstrates the totality of data recorded resulting in
a scatter plot of semi-random numbers. The evaluation process is then shown to provide for what
is believe to be the correct value.
What is particularly disturbing is the lack of published evaluation criteria for scientific data. When
someone says a database has been evaluated, where is the written evaluation criteria? Who
developed the criteria? How are the criteria implemented? Are the numeric data values
experimental or calculated? If calculated are they interpolated or extrapolated and by what process?
The presentation of non-bibliographic factual databases in this chemistry session is a recognition of
their coming of age. These databases have, as I am sure the authors will describe, taken a great deal
of time and money, and most importantly, intellectual effort to create. The actual creation of
the databases, the data management side, is but a small part of the overall effort involved. The
complex, detailed, and time consuming data evaluation, quality control, error checking, and data
verification will quickly make one realize the considerable problems these database producers face,
as compared to the problems encountered by their colleagues working on bibliographic databases.
3 EXAMPLES OF FACTUAL DATABASES
The four chemistry databases to be discussed are Beilstein - Handbook of Organic Chemistry,
HEILBRON - The Dictionary of Organic Chemicals (DOC - 5th Edition), KIRK-OTHMER
Encyclopedia of Chemical Technology, and Martindale - The Extra Pharmacopoeia (2). These
databases vary considerably in size with 1200 articles (containing over 6000 tables and 5000 figures)
in KIRK-OTHMER. Martindale has information on over 5100 drugs available in the UK. There are
chemical and physical data and key literature references for some 185,000 chemicals in HEILBRON
DOC and over three million chemicals and their properties which have been critically examined, and
other information given in the Beilstein database.
There are many, many other non-bibliographic factual databases in chemistry and related scientific
areas. A recent review (3) by Rumble and Lide indicated over 50 such databases, which they admit
is "most likely an underestimate by two thirds." A second useful source to examine for further
information on numeric databases is the Drexel Library issue on numeric databases (4). Two useful
sources for locating non-bibliographic factual databases are the Cuadra/Elsevier Directory of Online
Databases (5) and the Knowledge Industry Publications' Database Directory (2). One should
remember when using either of these two sources that they rely on the accuracy of the database
provider/producer. Thus one should carefully check exactly what is in the database, actual
availability (as sometimes the existence of a database comes after the press release), actual update
cycles (as sometimes databases for technical and/or economic reasons are not updated with the
frequency announced), and so forth. The four chosen for presentation at the International Online '86
Information Conference are meant to be representative of what is available, and hopefully represent
databases of considerable appeal. With many non-bibliographic chemistry databases consisting of
a few hundred records, any one of which is of great value to a narrow user community, it is
important to note one shouldn't be surprised when discovering there is only a small amount of data
in some of these databases. This is because the database producer can only provide what is there,
and this is often very little, compared with the enormous numbers associated with bibliographic
databases.
4 ACCESS TO FACTUAL DATABASES IN CHEMISTRY
Most of the access to factual databases in chemistry is provided by a cottage industry. Small, but
valuable chemical and chemical-related databases exist on many computer systems. There is no one
system which provides a supermarket or broad collection of numeric databases. As most of the
systems on which one can find numeric chemical databases are local (that is, not of the DIALOG
or SDC size, scope, and marketing capabilities), learning about them and gaining access is not a
simple matter. However once found, almost all of these systems are available via computer
networks, so actual access is relatively easy. But, having to learn a number of different online search
systems for each type of numeric, factual data, does not seem to have encouraged the use of these
databases. While there are services which have been developed to access bibliographic information
from a number of vendors, due to the specialized nature of the search software for numeric, factual
databases, this is not possible at present. However with the development of expert system front end
interfaces, I would expect considerable improvements in the area within a few years.
5 SOFTWARE FOR FACTUAL DATABASES
As for the search software used to obtain the information from these databases, the existing
bibliographic software (DIALOG, SDC ORBIT, Data-Star, BRS, CAS ONLINE, and so forth) is
acceptable for some of the databases. Other factual chemical databases, some of which are some of
which are not included in the presentations at this session, do require different and generally more
complex search procedures. It has been reported (6), for example, that the CAS ONLINE/STN
bibliographic search software Messenger "cannot handle the numeric data" of Beilstein. As factual
chemical databases grow, in both number and size of entries in each database, the necessary search
software will follow. However, at the present, most of the time one can only print out (but not
search) the numeric and factual data in these databases.
6 SUMMARY
The continued growth of databases in this area, which has been predicted to be ready for the fast
track "any day", will be slow, but certainly steady, as users of online services demand more than the
bibliographic, referral databases can provide. While costs for non-bibliographic databases are
generally higher than their larger bibliographic counterparts, the additional cost is usually well worth
the price. That is because one gets the desired information directly in the office or lab, rather than
having to go to a book or journal in the library to get what is wanted.
7 REFERENCES
1. D. L. Lide, Jr, Critical Data for Critical Needs, Science, 212, pages 1343 - 1349 (1981).
2. Database Directory, Knowledge Industry Publications, Inc., 701 Westchester Ave., White Plains, NY 10604.
3. J. R. Rumble, Jr. and D. R. Lide, Jr.,Chemical and Spectral Databases: A Look into the Future, J. Chem. Inf. Comput. Sci.,25, 231-235 (1985).
4. Numeric Databases, Drexel Library Quarterly, 19, #3&4, Summer- Fall (1982).
5. Directory of Online Databases, Cuadra Associates, 2001 Wilshire Blvd., Suite 305, Santa Monica, CA 90403.
6. Monitor, #63, page 6, May 1986.