Computer Activities at the Beilstein Institute

Stephen R. Heller
Research Leader
USDA, ARS, Beltsville, MD 20705



As a recent American Biotechnology Laboratory editorial (1) indicated, after over 150 years of printed inorganic (Gmelin) and organic (Beilstein) handbooks, there are considerable computer activities taking place in Germany that will have a significant effect on scientists throughout the world. While the editorial gave a very good, but brief description of the activities of the Gmelin Institute, this article will describe the activities of the institute with which Gmelin shares a building in Frankfurt, West Germany: The Beilstein Institute.

While Gmelin is a decade older than the Beilstein Institute, the Beilstein group has taken the lead in computerization activities. As might be expected, while the Gmelin Handbook presents information on more than 250,000 inorganic compounds, the Beilstein Handbook of Organic Chemistry, comprising over 300 volumes, contains factual data on over three million organic chemicals.

The Beilstein Handbook of Organic Chemistry is the premier printed collection of important published data on the preparation and properties of carbon compounds. It is, by far, the largest collection of evaluated scientific data in the field of chemistry. The Beilstein Handbook is produced by the nonprofit Beilstein Institute and distributed by the German publisher Springer-Verlag. The Beilstein Institute editors and staff critically sift and correlate the data from the literature and point out errors in the published data, providing the user with much more than just extracted results of scientific publications. Chemical Abstracts, which is purely bibliographic in nature, performs no quality control or examination of results, and in recent years, primarily uses authors' abstracts directly without any review. The current Beilstein Handbook consists of over 300 printed volumes, covering from 1830 to 1979. The original work or Basic Series (Hauptwerk in German) covered the literature from 1830 to 1909, the latter year being just about the time Chemical Abstracts started its bibliographic abstracting and indexing service. A number of Supplementary Series (Erganzungswerk in German) have since been published. The first, E Series I runs from 1910 to 1919, E II from 1920 to 1929, E m from 1930 to 1949, E IV from 1950 to 1959, and lastly, E IV runs from 1960 to 1979. In addition, there are some 27 Cumulative Indexes. Until the start of the 5th Series, the entire handbook was written in German,after which time the handbook was written in English, to take into account the movement of the scientific community from German to English as the primary language of scientific communication.

SANDRA

With this background, it is easy to understand both the need of chemists for the Beilstein Handbook and the great value offered by a computer program that is able to quickly and accurately indicate where, among the 300 volumes, data and information can be found for a given organic chemical. The Beilstein Handbook is well organized, but the structure of the ordering system is difficult to learn, and easily forgotten. Thus the SANDRA program is a tool every chemist dealing with properties of organic chemicals should have available for use with a PC.

SANDRA (available from Springer-Verlag Publishers, New York, New York), the acronym for Structure and Reference Analyzer, is an IBM PC DOS-based program (a Macintosh version is not expected) which takes a chemical structure the user draws easily on the PC screen, and indicates where in the 300 volumes of the printed Beilstein Handbook referenced compound can be found. The program, written at the Beilstein Institute, is a major advance in the tools which the Institute has created to help the user locate a chemical in the Beilstein Handbook. The highly structured Beilstein ordering and indexing system has always been a handicap to the ability of chemists to use the 300 odd volumes of the Beilstein Handbook. Now, with one very well designed and easy-to-use program, it is possible for organic chemists and even non-chemists to easily find references in the Beilstein Handbook.

SANDRA is easy to install, and the manual gives extensive instructions on how to use the program. It takes only a few minutes to load the program on a hard disk and start it running. The program requires 256K memory, DOS 2.0 or higher, a Microsoft or equivalent mouse, and an IBM CGA or equivalent graphics adapter board with a resolution of640 x 200 pixels. Version 1.0 of the program will not run with a Hercules graphics board, but a later version, due to be released by the end of 1987, will work with the Hercules board. The flexible graphical structure input is easy to learn. If a command is forgotten, the user need only touch the mouse pointer to the HELP COMMANDS box, and the list of commands is immediately displayed on the screen. There are 30 predefined structure templates, and the user's own templates can be created and stored.


Figure 1. Result of entering the amno-hydrozy aromatic ring, C11.H17.N.O. into the SANDRA program.


Figure 2. Final output of the SANDRA program analysis for the amino-hydroxy aromatic compound with pointer information and reference information.


A sample structure, C11.H17.N.O, shown after input, is illustrated in Figure 1. Atoms can be labeled and numbered for easier identification. The analysis is performed after an acceptable structure is entered, by simply typing "Q" to quit the structure entry and "2" to start the analysis part of the program. The analysis normally takes from 2 to 6 sec, depending on the computer used (IBM XT or AT, or Compaq 386) and the complexity and size of the structure entered. The author was able to draw a 70-atom molecule, which is the maximum number of atoms allowed by the program. The correct pointers and page numbers for the molecule were supplied in 12 seconds.

Figure 2 shows the output of the program after the analysis is finished. The examples shown in Figures 1 and 2 are a direct screen dump of what one sees on the screen, and were created using the IBM PC DOS print screen keyboard function. The output information in Figure 2 shows the value of the SANDRA program. A number of pointers are shown in Figure 2. The first is the H-page number (in this case 574-624), and then the Beilstein System number (in this case 1855) is shown. The degree of unsaturation (2n-6) and the carbon number (in this case 7, which turns out not to be the same as the number of carbons in the molecule) are further indicators to finding the exact page in the Beilstein Handbook where this compound can be found. The information in the bottom left-hand corner of Figure 2 shows the other Supplementary volumes (E IV, 13/3 and E III, 13/3) where more recent information on this chemical can be found. The lower right-hand corner contains the molecular formula of the molecule.

Computer readable files

The second area of computer activity at the Beilstein Institute is the creation of two databases, a Structure File and a Factual File. The Structure File will consist, eventually, of about three million chemical structures,and will provide a complete topological structure representation for each chemical. This file will consist of Beilstein Registry Connection Tables (BRCT), the largest collection of complete structure representations ever compiled. The BRCT will contain stereochemical information on organic molecules that the Chemical Abstracts database (of over eight million chemicals) does not contain. The BRCT will contain a number of fields, but details of the structure record can be found elsewhere.2 The entire structure file on computer tape will be available for lease in 1988 with a number of existing structure search software systems, such as the French DARC system, the Hungarian HTSS software, and the Molecular Design MACSS software.

The Beilstein Factual File will contain over 7.5 factual records for organic compounds dating back to 1830. More than 400 fields of information exist in the Factual File, with more than 60 of these being numeric data fields. The database will contain all the physical and chemical properties relevant to the compounds in the database. Each property will have a literature citation. The entire Factual File will be available in 1988, but only on-line. At present, the Beilstein Institute does not plan to lease the Factual File. The database will be available first on the Lockheed DIALOG system, and later on the STN Network. The delay in the latter version is due to the lack of appropriate software to perform numeric data searching. It is probable that additional on-line vendors will make the Factual File available at a later date. The decision as to what software DIALOG all use for structure searching has not yet been finalized; however, the data searching software is expected to be an enhanced version of the current DIALOG search software.

The actual factual database will consist of two parts, evaluated data and non-evaluated data. All data from the 1830s through 1979 (corresponding to the printed Beilstein Handbook H. E-I, E-II, E-III, E-III/IV, E-IV, and E-V Series) will be critically evaluated before going on-line in the DIALOG system. From 1980 onward, the data will first be put on-line in a non-evaluated form, to be replaced as the data are critically evaluated by the Beilstein scientists. This will update the Beilstein database. A further distinction will be made in the critically evaluated data that will be available. For the evaluated data, there will be two types of compounds, the Large Information Compounds (UC) and the Small Information Compounds (SIC). LICs, which comprise just a few percent of the entire database (probably less than 5% of the total chemicals in the database), are chemicals which are very important in the fields of chemistry, biochemistry, pharmaceuticals, and agrochemistry. For these LICs. the on-line Beilstein database on DIALOG will have only a subset of all the information available for the compound. However, the Beilstein Handbook win continue to have all the information for each UC. For the SICs all the information in the Beilstein Handbook will be available in the DIALOG on-line version.

The search capabilities of the system on DIALS will allow for the type of searches described by Andersen in his editorial (1). Thus, one will be able to enter a melting point (or boiling point) and a second property, such as a density of 0.8852, and quickly get a list of all chemicals that meet these criteria, along with the relevant literature citations. If there are skill too many compounds which satisfy these criteria, then a third criterion could be added to further narrow down the search to a reasonable number of possible answers.

While the cost to the user for the DIALOG and STN on-line versions has yet to be established, the Beilstein Institute and Springer-Verlag have stated that those who do subscribe and continue to subscribe to the printed Handbook editions will receive a discount on their on-line use.

Summary

With the activities of the Max Planck Society, Gmelin Institute, and the Beilstein Institute, the chemical community will be provided wide a vast treasure of high quality, evaluated chemical and physical property data on millions of organic and inorganic chemicals in computer-readable form. This will enable scientists to easily obtain, manipulate, and correlate data in a manner never before possible.



References

1. ANDERSEN. H.C.. Am. Biotech. Lab. S (1), 4-6 (1987).

2. JOCHUM, C., WITTIG, G., and WELFORD, S., "Search possibilities depend on the data structure: The Beilstein facts," in Procecdings of the 10th International Online Meeting (London), December 1986 (Learned Information, Medford, New Jersey, 1986), 43~52.