A Survey of Reaction Databases

Stephen R. Heller

US Department of Agriculture

Agricultural Research Service

BARC-W, Bldg. 011A, Room 164

Beltsville, MD 20705-2350 USA

Phone: 1-301-344-1709, FAX: 1-301-344-1823

Telemail: SRHELLER, BITNET: SRHELLER@UMDARS

Keywords: Chemical Reactions, Chemical Reaction Retrieval Systems, ORAC, REACCS, CASREACT, ChemInform, Beilstein

Abstract: A brief survey of chemical reaction databases, their contents, and the corresponding search and retrieval software are described.



1 INTRODUCTION

Over the past few years there has been considerable growth in the number and types of databases of chemical reactions. This growth has been stimulated by the availability of two commercial chemical reaction retrieval software systems. They are Organic Reactions Accessed by Computer (ORAC) for ORAC Ltd. (1) and REaction ACCess System (REACCS) from MDL Ltd. (2). This paper is an introduction to the reaction databases symposium presentations being presented at the 1989 London Online Conference and is the first summary of the various databases which has been compiled for publication.

Chemical synthesis is a fundamental activity in chemistry, particularly organic chemistry. A chemist who needs to synthesize a chemical needs to know what methods are available to convert chemical A into chemical B quickly, efficiently, with as high a yield as possible, and at the lowest possible cost. To do this the chemist needs to perform a substructure search on a database of chemical reactions. The ORAC and REACCS in-house software systems, along with the online CAS STN system perform such searching. There are differences in the software capabilities of these systems, but this paper is not designed to cover that topic. For this the reader is referred elsewhere (3).



2 AVAILABLE DATABASES



The following is a list of available chemical reaction libraries which will be discussed in this article. Additional details on these databases will be found in the papers following this article. The FIZ-Chemie ChemInform and Beilstein databases are not included here, as they were not commercially available at the time this paper was being prepared in mid 1989.



Database Number of Years

Reactions of Coverage

A. CASREACT 570,000 1975 - present

B. ORAC Core database 50,000 1900 - present *

C. ORAC - Theilheimer

- Synthetic Methods

of Organic Chemistry 47,000 1946 - 1980

D. ORAC Academic

Collaboration 5,000 1987 - present

E. ORAC Heterocyclic 15,000 1980 - present

(* Most of the reactions are from 1980 onwards, but important reactions dating back to the turn of the century are in database.)

F. REACCS - Theilheimer

- Synthetic Methods

of Organic Chemistry 47,000 1946 - 1980

G. REACCS - Derwent's

Journal of Synthetic

Methods 29,000 1980 - present

H. REACCS -

Organic Synthesis 5,000 1921 - present

I. REACCS - Current

Literature File (CLF) 25,000 1983 - present

J. REACCS - CHIRAS -

Asymmetric Synthesis 5,000 1975 - present

Neglecting redundancy the overall numbers for the reaction databases available on CAS, ORAC, and REACCS are, respectively, about 570,000, 120,000, and 110,000. As might be expected these "raw" numbers have little meaning for most chemists. The main reasons are that the ORAC and REACCS databases have only about 100,000 unique reactions, and the large CAS database is limited in being only from the recent (post 1985) literature and having limited information (e.g., no stereochemistry).

3 DATABASE CONTENT

CASREACT

The databases listed in the previous section have considerable variation in their content. The CASREACT database (4), which was initiated a few years ago, has reactions from the chemical literature going back only to 1985. It contains about 570,000 single-step reactions found in some 39,000 records. As many of the very important and most frequently used chemical reactions go back decades, if not longer, there are clear limitations to the CAS database. While the CAS database does have "all" reactions in the database, as opposed to just the "important" or "interesting" reactions which appear in the databases from ORAC and REACCS, it is easy to be swamped with the volume of the CAS database. CAS decided that quantity would be their main focus, with little or no concern about the chemistry or content of the database. This is consistent with the bibliographic nature of their abstract service. The CASREACT database is derived from chemical reactions found in over 100 important synthetic organic chemistry journals.

The data elements or parameters searchable in the CASREACT database include the CAS Registry Number, starting materials, products, catalyst, reagents, and solvents. Missing from the database are parameters such as stereochemistry, reaction temperature, comments on the reaction (such as mechanism information), and labeling of reaction centers (which atoms in the molecules were involved in the reaction). A number of parameters can be displayed, but are not searchable. These include bibliographic information, in-depth substance and subject indexing, and abstracts.

ORAC and REACCS Databases

Both of these systems contain the Theilheimer database and the databases are created in essentially the same manner. As an illustration of how a reaction is entered into a database an example from ORAC has been chosen, as is shown in Figures 1-4. The bottom right hand corner of Figure 1 shows the entire reaction being entered, with the product shown in the main drawing area. Figure 2 shows how one maps the atom-to-atom correspondences between reactant and product. The asterisks are used to tag the atoms of concern and the numbers show the details of the correspondences. Figure 3 shows the data entry form for additional information about the reaction, which is self evident from the labels in the figure. Figure 4 shows part of the final version of an entry.

The Theilheimer - Synthetic Methods of Organic Chemistry

database is derived from volumes 1-35 of the printed editions, covering chemical reactions published from 1946 to 1980. There have been no updates to Theilheimer since 1980.

The ORAC core database comes from literature searches performed by the ORAC staff.

The ORAC Academic Collaboration database comes from reactions submitted by university collaborators who have access to the ORAC software in exchange for providing reactions to the database.

The ORAC Heterocyclic database comes from literature abstracting of heterocyclic reactions.

The Derwent database is the computer readable version of the 1980 - 1987 printed publication entitled the Journal of Synthetic Methods (JSM), published monthly by Derwent. JSM includes patent coverage. JSM can be thought of, in some ways, as picking up where Theilheimer stopped in 1980.

The Organic Synthesis database is the computer readable version of the printed Organic Synthesis reference collection, which dates back to 1921 and currently runs through Volume 67 (1987). Organic Synthesis is a collection of well tested, verified methods for the preparation of specific compounds. There are about 100 new reactions added each year to the database.

The Current Literature File (CLF) is a database originally created by an MDL REACCS customer and now being added to by contributions from a number of sources. The reactions come from about 35 journals abstracted since 1983.

The CHIRAS database of asymmetric synthesis contains synthetic routes for optically active materials used primarily in the agrochemical and pharmaceutical industries. The reactions cover the literature from 1975 to the present. CHIRAS was initially developed by scientists at Hoffmann-La Roche labs in the USA.

The data elements or parameters searchable in ORAC and REACCS databases include the starting materials, products, catalyst, reagents, solvents, stereochemistry, bonds which change during the reaction, author, journal, year of publication, name of reaction, and temperature. The ORAC version also has comments about the reaction (such as "Product steam distilled immediately on completion of reaction."), reaction keywords (such as, metal amide, migration, rearrangement, and so forth), and available physical data (such as melting point, boiling point, refractive index, and so forth). Only the REACCS CHIRAS database has the CAS Registry Number included as a parameter.



4 SUMMARY

This article has presented the reader with an overview of the different chemical reaction library databases and their contents. In addition to these databases, others are being developed by ORAC Ltd. and MDL Ltd., as well as by the German government FIZ-Chemie in Berlin and the Beilstein Institute. As these last two groups have not yet released any products, no discussion of their databases have been included here. In any event these two databases would likely be made available to both the in-house systems (ORAC and REACCS) as well as the online STN CASREACT system with essentially the same file content as the databases discussed in this article.



5 REFERENCES

1. ORAC Limited, 175 Woodhouse Lane, Leeds LS2 3AR, United Kingdom, (Telephone: 0532-441821, FAX: 0532-448283).

2. Molecular Design Limited, 2132 Farallon Drive, San Leandro, CA 94577, (Telephone: 415-895-1313 or 800-635-0064, FAX: 415-352-2870).

3. Borkent, J. H., Oukes, F., and Noordik, J. H., "Chemical Reaction Searching Compared in REACCS, SYNLIB, and ORAC", J. Chem. Inf. Comput. Sci., 28, 148-150(1988).

4. CASREACT is available only online from CAS STN, 2540 Olentangy River Road, PO Box 3012, Columbus, OH 43210, (Telephone 614-447-3600 or 800-848-6538, ext. 3731).