Online Chemical Information

Stephen R. Heller
USDA, ARS, ASRI, BARC-W
Bldg. 007, Room 056
Beltsville, MD 20705 USA




ABSTRACT

A brief summary of the chemical information available online in computer systems around the world are described and discussed. Both databases and software systems will be included.



INTRODUCTION

IUPAC has recently created a new committee on Chemical Databases, with members from the USA, USSR, FRG, and Japan. The Committee has three main terms of reference. The first is to advise the IUPAC president and Executive Committee on all aspects of computerized databases of chemical properties, needs for standardization for databases and chemical structure records, and policy on database dissemination. The second is to work with the IUPAC Commissions on the design and implementation of databases and appropriate software and to encourage maximum compatibility of databases from different groups within IUPAC. Lastly the committee is to promote, in collaboration with other ICSU bodies, a higher level of awareness of the application of computers in the management, dissemination, and use of chemical data. As part of this third activity of the committee, this presentation on Online Chemical Information has been prepared as the first educational activity of the committee. The committee is also working on a Glossary of Computer Terms for chemists, and is preparing a draft list of chemical codes for IUPAC (including greek letters and other special symbols), which would be the computer equivalent of the IUPAC green book of symbols and standards. With the heightened awareness of the need for and use of computers as a tool to aid the chemist, it is felt the committee will have an growing and interested audience throughout all of IUPAC in its activities.

The chemical literature and chemical information was once a finite and manageable resource. However, with the information explosion and the growth of the scientific literature, it has become almost infinite and unmanageable, making it very difficult for a chemist to keep abreast of the latest developments in a given area of research. This problem has been recognized by the scientific information community, mostly notably Chemical Abstracts, and this has led over the past twenty years to a considerable amount of automation of chemical information and data.

There are many books (1) and articles on the subject of chemical information and computerized databases, so this brief article will only survey the field, to assure that the reader is aware of what is available and the major features and characteristics of the products and services on the market. One very critical point to be made, which it is hoped will be the impetus for the reader to learn more about this subject, is that the existence and availability of online information makes it possible for anyone, anywhere in the world, in both big and small organizations, to have the same access to information at the same time and at (essentially) the same cost.

TYPES OF CHEMICAL INFORMATION

There are a number of types of chemical information which have been automated or computerized, and it is important to know and understand the differences between these types. The first is reference data, often called bibliographic chemical data. This type includes Chemical Abstracts, which publishes some 400,000 abstracts per year, as well as the Institute for Scientific Information (ISI) Index Chemicus, which also publishes a similar number of abstracts per year. These databases contain textual information, citations, sometimes abstracts, but not factual data.

The second type of chemical database is called non-bibliographic, factual, source or numeric data. These databases contain actual numbers or measurements, like mass spectral data, infrared data, boiling points, partition coefficient values, and so forth. Handbooks and similar types of databases, such as Beilstein - The Handbook of Organic Chemistry, Heilbron - The Dictionary of Organic Chemicals, The Cambridge Crystal Database, The Merck Index, and so forth are examples of databases which would fit into the category of source or non-bibliographic databases.

In the field of chemistry there is third type of database which is related to, and a possible companion database to, these two types, which is called a chemical structure database. This is simply a database in which the chemical structure has been represented in computer readable form, usually called a connection table. An example of a connection table is given in Figure 1, which shows both the usual chemical structure, with a molecular formula C8H7ClO, and below the structural diagram is the computer representation of the chemical. Each atom is numbered and then identified with a letter(s) representing which element it is, following by the other atoms to which it is connected. Lastly, the type of bond connection is given. Bond type 9 means aromatic, type 5 means a chain single bond, and type 1 is a ring single bond. In the past (and also continuing into the present) there have been many other representations of chemical structures, including nomenclature, such as IUPAC names, and Wiswesser Line Notation (WLN). What differentiates a connection table from these linear notations (i.e., notations which can be written on one line) is the two-dimensional nature of the connection table as well as the ability to search for chemical fragments or sub-structures in a completely open and total manner. (it is worthwhile to note that structure searching by name fragments and WLN is possible, but it is not as good, efficient, and complete as connection table searching.)



CHEMICAL DATABASES

This short article cannot go into considerable detail about the many chemical and chemical related databases, so it will be necessary for the reader to refer to either the database producer, the online system on which a particular database is available, or to a directory of databases for the details desired. In 1980 there were about 500 computer readable databases available in all field of science, technology, business, and other areas, with some 75 companies making these databases available online in a computer system which was available for access by telephone and computer terminal connection. By 1986 this number has grown to over 2900 databases available from about 450 different sources.

In many cases the same database is available from more than one source. For example CA Search (which has a total of over five million abstracts in computer readable form), the online computer version of Chemical Abstracts, is available from some nine online vendors throughout the world, and is updated every 2-4 weeks, depending on which online vendor you chose. The Merck Index is available from three different companies. The 13-CNMR database is available from two companies, while the LogP database is available from only one company. The definition of a chemical database is generally quite broad, and includes the usual bibliographic databases, many patent databases, chemical property databases, and chemical structure/nomenclature databases. An excellent source of information for the latest summary of chemical (and other) databases is the Directory of Online Databases (2), published quarterly, and usually available in the library.

In the field of bibliographic databases, the most widely used is the Chemical Abstracts database, which adds over 400,000 citations per year to the database which goes back to 1967, and totals over 7,000,000 citations. The ISI database, Index Chemicus, which covers the literature from 1962 to date, includes only new chemicals, and thus is smaller (4,000,000) in size than the CA database.

In addition to these two large abstracting services in chemistry, the American Chemical Society (ACS) has computerized nineteen of its journal publications, so that now the entire journal article (less the tables and diagrams) is in computer readable form and can be searched (3).

In the area of non-bibliographic, factual, or numeric databases, there are many, and the list continues to grow. One important point to be made about these databases as opposed to the bibliographic databases is their size. The numeric database are usually very small in numbers of chemicals. Some, like CESARS (a database of detailed and evaluated toxicological data) have information on about 200 chemicals. The 13-CNMR database range in size from 15,000 to 50,000. The mass spectral databases range from 40,000 to over 100,000. In the case of mass spectral data, the larger database does not have as complete information on each chemical as the smaller database. So one must be careful to examine quality as well as quantity.

Thermodynamic databases are available from producers like DECHEMA (FRG), the Thermodynamics Research Center (Texas A&M University), and the Physical Properties Data System (PPDS) of the UK. The IUPAC Committee on Chemical Databases is working with the IUPAC Commissions to make databases such as conductance, solubility, transport properties, and enthalpy of vaporization available, both on computer tapes and floppy disks as well as through online vendors in the near future.



ONLINE COMPUTER SYSTEMS

Each type of database mentioned about requires computer software to search, retrieve, and/or analyze the information in the particular database. Such software needs a computer to run on. Thus the online computer systems (all of which are in the USA, except as noted), such as the DIALOG, ORBIT, BRS, Data-Star (Switzerland), JICST (Japan), DARC (France), STN (USA and FRG), CAS ONLINE, Pergamon Online (UK), TDS, CIS, and so forth are combination of a database, software, and computer hardware which forms a complete system (2). Generally speaking these online computer system vendors do not create and own the databases which are available on their systems. Thus, the approximately 200 databases on the DIALOG system are maintained and owned by their respective creators, not DIALOG. DIALOG is simply a supplier of the information from others. Some, like CAS have primarily their own databases on their system (STN), but most of their databases are also found on other systems (such as DIALOG, BRS, QUESTEL, ORBIT, Pergamon Online, and others).

In addition to the above systems which search bibliographic and non-bibliographic databases, there are two major online systems which search for chemical structures. They are CAS ONLINE and the QUESTEL DARC system. Both have the entire CAS file of over seven and one half million chemical structures, and the QUESTEL DARC systems also has the ISI database of over three and one half million chemical structures which are associated with the ISI Index Chemicus database.

All of these systems are available, usually via a local dial-up telephone, in most countries, and usually at nominal prices. Figure 2 shows the author sitting at a computer terminal connected to a chemical database system via an ordinary telephone line which can be seen in the bottom left hand of the figure. The computer systems of some vendors are available 24 hours per day, 7 days per week. Availability of the remaining systems is 5-6 days per week, and usually about 20 hours per day. Thus it is fair to say chemical information is essentially available anywhere and at anytime. As telecommunications become easier and less expensive, usage will increase. In many countries and organizations, access is available through a library or similar information group, often at little or no cost to the end user. For a novice it is probably better to let someone else perform the searching, so you can learn how it is done.



SUMMARY

It has be the intent of this article to provide a brief overview of computer based chemical information in terms of the databases and how they are available to the worldwide scientific community. Understanding what is available and where such information and chemical data can be found is becoming more and more important in the high technology world we live in today. IUPAC is participating in this area through its new Committee on Chemical Databases, and all IUPAC members are encouraged to provide input into this committee to help create and make available for dissemination and distribution the valuable data being gathered and evaluated by the IUPAC Commissions.



REFERENCES

1. See, for example, Y. Wolman, "Chemical Information - A Practical Guide to Utilization", John Wiley & Sons, New York, (1983), and J. Ash, et. al., "Communication, Storage and Retrieval of Chemical Information", Ellis Harwood, Chichester, (1985).

2. For a explanation of all the acronyms, what databases these online vendor companies have available, and how these companies can be contacted, an excellent reference source is: Directory of Online Databases, Cuadra Associates, 2001 Wilshire Blvd., Suite 305, Santa Monica, CA 90403 USA.

3. S. W. Terant, L. R. Garson, B. E. Myers, and S. M. Cohen, "Online Searching : Full text of American Chemical Society Primary Journals", J. Chem. Inf. Comput. Sci., 24, 230(1984).