FACTUAL DATABASES IN CHEMISTRY: AN INTRODUCTORY OVERVIEW



Stephen R. Heller
Model and Database Coordination Laboratory,
Agricultural Systems Research Institute
USDA, ARS, BARC-W, Bldg. 007, Room 56
Beltsville, MD 20705 USA




Abstract: The growth in non-bibliographic factual databases in the field of chemistry and related sciences is an important phenomenon to both the information scientist and the working chemist. This presentation, and the four following talks in the chemistry session of the International Online '86 Information Conference, will provide one with both an overview of the field, as well as some specific details on how such databases are created, evaluated, maintained, and used by the community.

1 INTRODUCTION

As chemists and physical scientists all know, Chemical Abstracts is one of the largest and most widely used bibliographic databases in the world today. It is quite large in size, and broad in coverage. It contains a great deal of information, but little factual data. Furthermore it contains abstracts which report whatever the authors choose to say. Thus these abstracts rarely contain evaluations or quality control, but just report, in an objective manner, what is presented. (Of course one can argue that author abstracts, now being used more often by abstracting services are not as objective as an independent abstractor.) What is contained in such bibliographic databases is scientific quantity, not scientific quality. These abstract entries, which often have errors, are normally never corrected, not even for the online versions.

In contrast, non-bibliographic or factual databases contain data, numbers, and perhaps even some knowledge. While there may be over seven million chemicals reported in the scientific literature, there are data on only a portion of these. Considering the Beilstein database dating back about 150 years which does contain data on organic chemicals has about half the number of compounds which are in the Chemical Abstracts Registry database. Since it is presently not possible to compare the Beilstein and Chemical Abstract structures, there is no way to examine the question of what chemicals may be missing from each database. However, it is known that Chemical Abstracts does have chemicals in its Registry System which do not exist, and hence there could be no data for these chemicals. For example, if an author reports energy calculations indicating that a particular compound cannot be made, the compound, because it is discussed in the paper, is assigned a CAS Registry Number, and thus becomes a material which has been "reported in the literature". While this is true from an abstracting and librarian point of view, it is not clear that scientists would consider this "reported in the literature".) Furthermore, in the case of most of these chemicals, the data reported is often incomplete or partial form and much of the data is conflicting. Collecting all chemicals referenced is not a data quality activity and perhaps even does a disservice to the chemical community by implying factual information that may not exist.

2 GROWTH AND QUALITY OF FACTUAL DATABASES

The growth in non-bibliographic factual databases in chemistry has been considerable over the past few years, in quantity with respect to the types of databases, but not with respect to the actual quality of data. The reason is quite simple. Scientific data requires careful and professional examination and evaluation of information before such data should be allowed to be put into a database. If there are any gross and obvious errors, these will usually stand out like a sore thumb to knowledgeable professionals, but it is always not clear people are aware of all the data values for a given property and what can be done with conflicting data. As noted by Lide (1) in an interesting paper, entitled "Critical Data for Critical Needs", there is a marvelous diagram of some 200 reported measurements of the thermal conductivity of copper which demonstrates the totality of data recorded resulting in a scatter plot of semi-random numbers. The evaluation process is then shown to provide for what is believe to be the correct value.

What is particularly disturbing is the lack of published evaluation criteria for scientific data. When someone says a database has been evaluated, where is the written evaluation criteria? Who developed the criteria? How are the criteria implemented? Are the numeric data values experimental or calculated? If calculated are they interpolated or extrapolated and by what process?

The presentation of non-bibliographic factual databases in this chemistry session is a recognition of their coming of age. These databases have, as I am sure the authors will describe, taken a great deal of time and money, and most importantly, intellectual effort to create. The actual creation of the databases, the data management side, is but a small part of the overall effort involved. The complex, detailed, and time consuming data evaluation, quality control, error checking, and data verification will quickly make one realize the considerable problems these database producers face, as compared to the problems encountered by their colleagues working on bibliographic databases.

3 EXAMPLES OF FACTUAL DATABASES

The four chemistry databases to be discussed are Beilstein - Handbook of Organic Chemistry, HEILBRON - The Dictionary of Organic Chemicals (DOC - 5th Edition), KIRK-OTHMER Encyclopedia of Chemical Technology, and Martindale - The Extra Pharmacopoeia (2). These databases vary considerably in size with 1200 articles (containing over 6000 tables and 5000 figures) in KIRK-OTHMER. Martindale has information on over 5100 drugs available in the UK. There are chemical and physical data and key literature references for some 185,000 chemicals in HEILBRON DOC and over three million chemicals and their properties which have been critically examined, and other information given in the Beilstein database.

There are many, many other non-bibliographic factual databases in chemistry and related scientific areas. A recent review (3) by Rumble and Lide indicated over 50 such databases, which they admit is "most likely an underestimate by two thirds." A second useful source to examine for further information on numeric databases is the Drexel Library issue on numeric databases (4). Two useful sources for locating non-bibliographic factual databases are the Cuadra/Elsevier Directory of Online Databases (5) and the Knowledge Industry Publications' Database Directory (2). One should remember when using either of these two sources that they rely on the accuracy of the database provider/producer. Thus one should carefully check exactly what is in the database, actual availability (as sometimes the existence of a database comes after the press release), actual update cycles (as sometimes databases for technical and/or economic reasons are not updated with the frequency announced), and so forth. The four chosen for presentation at the International Online '86 Information Conference are meant to be representative of what is available, and hopefully represent databases of considerable appeal. With many non-bibliographic chemistry databases consisting of a few hundred records, any one of which is of great value to a narrow user community, it is important to note one shouldn't be surprised when discovering there is only a small amount of data in some of these databases. This is because the database producer can only provide what is there, and this is often very little, compared with the enormous numbers associated with bibliographic databases.

4 ACCESS TO FACTUAL DATABASES IN CHEMISTRY

Most of the access to factual databases in chemistry is provided by a cottage industry. Small, but valuable chemical and chemical-related databases exist on many computer systems. There is no one system which provides a supermarket or broad collection of numeric databases. As most of the systems on which one can find numeric chemical databases are local (that is, not of the DIALOG or SDC size, scope, and marketing capabilities), learning about them and gaining access is not a simple matter. However once found, almost all of these systems are available via computer networks, so actual access is relatively easy. But, having to learn a number of different online search systems for each type of numeric, factual data, does not seem to have encouraged the use of these databases. While there are services which have been developed to access bibliographic information from a number of vendors, due to the specialized nature of the search software for numeric, factual databases, this is not possible at present. However with the development of expert system front end interfaces, I would expect considerable improvements in the area within a few years.

5 SOFTWARE FOR FACTUAL DATABASES

As for the search software used to obtain the information from these databases, the existing bibliographic software (DIALOG, SDC ORBIT, Data-Star, BRS, CAS ONLINE, and so forth) is acceptable for some of the databases. Other factual chemical databases, some of which are some of which are not included in the presentations at this session, do require different and generally more complex search procedures. It has been reported (6), for example, that the CAS ONLINE/STN bibliographic search software Messenger "cannot handle the numeric data" of Beilstein. As factual chemical databases grow, in both number and size of entries in each database, the necessary search software will follow. However, at the present, most of the time one can only print out (but not search) the numeric and factual data in these databases.

6 SUMMARY

The continued growth of databases in this area, which has been predicted to be ready for the fast track "any day", will be slow, but certainly steady, as users of online services demand more than the bibliographic, referral databases can provide. While costs for non-bibliographic databases are generally higher than their larger bibliographic counterparts, the additional cost is usually well worth the price. That is because one gets the desired information directly in the office or lab, rather than having to go to a book or journal in the library to get what is wanted.

7 REFERENCES

1. D. L. Lide, Jr, Critical Data for Critical Needs, Science, 212, pages 1343 - 1349 (1981).

2. Database Directory, Knowledge Industry Publications, Inc., 701 Westchester Ave., White Plains, NY 10604.

3. J. R. Rumble, Jr. and D. R. Lide, Jr.,Chemical and Spectral Databases: A Look into the Future, J. Chem. Inf. Comput. Sci.,25, 231-235 (1985).

4. Numeric Databases, Drexel Library Quarterly, 19, #3&4, Summer- Fall (1982).

5. Directory of Online Databases, Cuadra Associates, 2001 Wilshire Blvd., Suite 305, Santa Monica, CA 90403.

6. Monitor, #63, page 6, May 1986.