Computerized Spectroscopy Databases

Stephen R. Heller


(This is the fourth in a series of articles aiming to promote a higher awareness of the computer applications in the management, dissemination, and uses of chemical data. It describes a number of computerized spectroscopy databases).

This article is part of the efforts of CCDB to familiarize the IUPAC membership of various aspects of computer activities in chemistry. Previous articles have described online databases (1), the Beilstein and Gmelin databases (2), and chemical structure searching (3). This article will concentrate on the spectroscopy databases listed in Table 1. Crystallographic databases will not be covered here for reasons of space limitations. For further details about these databases (Powder Diffraction, NIST Single Crystal Data, Organic Xray Crystal Data, and the Inorganic Xray Crystal Data, please refer to the sources and related information (4-8).

TABLE 1 - Spectral databases

Mass Spectrometry
Infrared
Raman
1H NMR
11B NMR
13C NMR
15N NMR
17O NMR
19F NMR
31P NMR
ESR

Computer-based spectral databases are both some of the oldest and most widely used computerized products in chemistry. This article will give an overview of the field, as well as highlight some of the more interesting activities and sources of spectral data. For more comprehensive articles on this subject the reader is referred elsewhere (9,10). This article will cover only computer readable databases which are available on magnetic tape, floppy disks, or on CD-ROM. No mention will be made of the many spectral collections available in printed form.

The reason for the long term and wide spread use of computerized spectral databases stems from their practical value in analytical and organic chemistry labs, as well as from the ease in which these data could be made available, retrievable, and searchable. While the value of such databases seems clear, the need has not been met to the satisfaction of many (11, 12, 13). Isenhour (12), wrote of the frustration that he and his colleagues have with the lack of large, representative, high quality spectral databases which would enable further research in search and interpretation studies. Furthermore Shelly (13) has written "Although many spectral databases have been created, few are of high quality and many are useless". The reason for this are the errors in the databases, incomplete data, and lack of structural information associated with a spectrum, all of which inhibit further useful work in the areas of concern to Isenhour. As we will see later in the paper, there has been some progress in the five years since these comments were made. As these databases have continued to develop the emphasis has moved from quantity of data to the quality of the data. That is, the largest collections are not necessarily the best, nor the most widely used. This point will be further discussed later in this article.



Infrared Data

Infrared data is the oldest type of spectral data which has been available to the chemical community. The original printed collections of data such as from Sadtler, were from prism and grating instruments, which have lower resolution than the modern FT-IR instruments. As Lias has noted (9), many spectroscopists believe that such older are not adequate for good reference databases. An example of the new generation of IR data is the Aldrich-Nicolet Digital FT-IR database of condensed phase (liquid) spectra and the condensed phase Sigma-Aldrich Biochemical Library. Both contain about 12,000 spectra each. In addition there is a vapor phase Aldrich FT-IR database of 5,000 spectra. There is also a collection of about 61,000 condensed phase spectra available from Sadtler. Sadtler also sells a vapor phase library of some 5,000 spectra. There is a smaller collection of 3,000 FT-IR spectra which EPA commissioned in the 1980's for environmental analysis studies. More recently NIST has been undertaking experimental work in order to expand this collection and provide good reference data. NCLI (National Chemical Laboratory for Industry in Tsukuba, Japan (14)) has a database of about 26,000 unique IR spectra, which grows to 60,000 spectra when multiple spectra of the same chemical are included. Chemical Concepts also has a database of some 28,000 IR spectra obtained from BASF. These are full, FT-IR spectra, with connection tables.



Raman

The NCLI (14) has a database of about 3600 Raman spectra, without connection tables.



Mass Spectrometry

Mass spectral data, primarily owing to its importance in environmental analysis, has become the most widely used tool for chemical substance identification. The regulatory power of the US EPA has been the driving force behind this activity and the development of what now is the called the NIST/EPA/MSDC mass spectral database (15). As there has been a recent discussion (16) regarding the matter of mass spectral data quality it is instructive to comment here about this issue. Over the past three decades a number of mass spectral databases have been developed in the US, UK, Germany, and Japan. At present the two largest of these have evolved with very different philosophies in how they are being built. A third database and associated search software of about 30,000 high quality mass spectra from a number of Max-Planck Institutes in Germany, as well as from universities, is also available from Chemical Concepts (17).

The John Wiley collection (18) attempts to collect any available spectra with no published acceptance criteria. This database lacks the structural information which Shelly has commented on (13) and does not contain any connection tables. Thus one often (some 17,000 out of almost 140,000 spectra) finds only 2-5 peaks in a spectrum. To compensate for the lack of data the developer of the Wiley database computer-generates isotopic abundance peaks. On the other hand the NIST collection of some 54,000 spectra, virtually of which has connection tables, consists primarily of reference spectra for chemical analysis, many of which have been obtained by the US Government by running the mass spectra in the laboratory, thus insuring both high quality as well as complete spectra. The importance of this point is highlighted by the fact that of the spectra unique to the Wiley database almost 42% (some 58,000 spectra) have less than 10 peaks, whereas this is true in 3.5% of the NIST spectra 916). The NIST scientists also found that a spectrum unique to the NIST database contains nearly 5 times the number of peaks as a spectrum unique to the Wiley database. In addition to the data quality study by NIST, the Max Planck Institute in Mulheim, Germany is conducting their own data quality analysis of mass spectral databases and they have also found similar results. In summary, the issue of data quality is one that is not easily discovered and potential users of these spectral databases should carefully examine the content and description as part of their consideration of which database to obtain which will meet their needs.

NMR databases

While there are publicly available spectral databases of a seven different of nuclei (primarily, 1H NMR, 11B NMR, 13C NMR, 15N NMR, 17O NMR, 19F NMR, and 31P NMR), the discussions here will focus on the 13C NMR databases, as this data is presently regarded as the most useful.



13C NMR

The premier collection of over 100,000 13C NMR data is the so-called Bremser database, named after the original developer at BASF (19). This database has been developed at BASF over the past two decades and has been evaluated and checked carefully by the BASF scientists. The database also contain chemical structures and chemical shifts have been given assignment. Recently the CAS Registry Number has been added to the database entries which has further enhanced its value and usefulness. There is also a database of about 30,000 spectra and corresponding connection tables available from Sadtler (20).

Recently the German government has decided to sponsor a company, Chemical Concepts (17), to take the Bremser BASF database, together with spectral data from a wide variety of other sources, and associated search and analysis software and make the entire compilation available to the chemical community. This database, which continues to grow at a substantial rate, now contains about 100,000 spectra, all of which have connection tables in which each carbon atom is assigned a particular chemical shift.



1H NMR

There are four major collections of computer based 1H or proton NMR data, obtained mostly at a frequency of 60 or 90 Mhz. These are Chemical Concepts (about 13,000 spectra), Sasaki/Japan (10,000 coded spectra and 4,000 digitized full spectra (21)), NCLI (about 6,500 spectra)(14), and the Institute of Organic Chemistry, Novosibirsk, USSR (about 50,000 spectra (22)).





11B NMR

Chemical Concepts (17) distributes a database of about 9000 spectra of 11B NMR data.

15N NMR

Chemical Concepts (17) distributes a database of about 1000 spectra of 15N NMR data.

17O NMR

Chemical Concepts (17) distributes a database of about 900 17O NMR database.

19F NMR

Fraser-Williams (23) distributes a PC searchable database of about 10,000 spectra of 19F NMR data. The database is data and text searchable. There is also a 19F NMR database of about 2000 spectra available from Chemical Concepts (17).



31P NMR

Chemical Concepts (17) distributes a small database of about 2200 spectra of 31P NMR data.



ESR

NCLI (14) has created a database of about 1,300 ESR spectra. It has not yet been made publicly available.



Integrated Systems

The BASF system, parts of which have ben described earlier in this article, is the one which Chemical Concepts has been refining and improving so that it is can be viable and easily product for the scientific community is called SpecInfo. The unique features of SpecInfo, as opposed to most other single spectral database systems is the integration of their software for a total solution to a problem using spectral identity and similarity, spectral interpretation, spectral simulation, spectral calculations, and structure determination. This is best illustrated in Figure 1, which is a schematic of how data from a number of different spectroscopies can be integrated into a solution of a typical lab problem. Figure 2 shows the status the SpecInfo spectral databases.

There are two important features of the SpecInfo which should be noted here. The first is the ability to add your own spectra to the database so that an individual or laboratory can make use of the powerful and various software analysis programs in SpecInfo. The second is the policy of the Chemical Concepts to work with universities in providing them with a no cost copy of the system in return for spectral contributions.

Summary

In summary this article has described a number of computer readable databases in the area of spectroscopy, as well as one integrated spectral system. The size and quality of these databases varies considerably, as does their cost. As one considers the need to use any of these databases, the problems which need to be solved should be well understood so that the choice of which database to obtain will indeed provide you with the proper solution to your problem.

References

1. S. R. Heller, "Online Chemical Information", Chem. Int., 9, 136-138(1987).

2. S. R. Heller, "Computer Databases of the Beilstein and Gmelin Institutes", Chem. Int., 11, 49-52(1989).

3. S. R. Heller and D. E. Meyer, "Chemical Substructure Search Software for Personal Computers", Chem. Int., 12, #3, 89-94 (1990).

4. International Centre for Diffraction Data, 1601 Park Lane, Swarthmore, PA 19081 USA. The cost of the complete set on CD-ROM is $ 1,250 for those who already subscribe to the printed product.

5. National Institute of Standards and Technology, Office of Standard Reference Data, Building 221, Room A-325, Gaithersburg, MD 20899 USA. The cost of the database is $1000.

6. Cambridge Crystallographic Data Centre, Lensfield Road, Cambridge CB2 1EW, UK. The cost of the database is difficult to specify owing to the unique pricing and access policy of this data center.

7. Inorganic Crystal Data Center, University of Bonn, Institute of Inorganic Chemistry, Gerhard-Domagk Strasse 1, D-5300 Bonn 1, Germany. This database is available online via the STN and CAN/SND (Canada) networks.

8. An discussion of the above four databases as well as related databases and software systems can be found in "Crystallographic Databases", published by the Data Commission of the International Union of Crystallography, Chester, UK (1987).

9. S. G. Lias, "Numeric Databases for Chemical Analysis", J. Res. of the NIST, 94, 25-35 (1989)

10. W. A. Warr, "Spectral Databases", Chemometrics and Intelligent Lab. Sys., in press, 1991.

11. S. R. Heller and R. Potenzone Jr, "Computer Readable Analytical Chemical Data - Comments on a Critical Need", Trends in Anal. Chem., 2, 218-221 (1983)

12. T. Isenhour, "Spectroscopic Databases", J. Chem. Inf. Comput. Sci., 26, 2A (1986).

13. C. Shelly, "Problems That Prevent Computer-Assisted Structure Elucidation From Becoming a Practical Tool, pages 6-25, in Computer -Supported Spectroscopic databases, Ed. J. Zupan, Ellis Horwood, UK (186).

14. Dr. K. Tanabe, National Chemical Laboratory for Industry, 1- 1 Higashi, Tsukuba, Ibaraki 305, Japan

15. The database is available either on high density floppy disks or CD-ROM. The system program written by Dr. Stephen E. Stein, National Institute of Standards and Technology, Office of Standard Reference Data, Building 221, Room A-325, Gaithersburg, MD 20899 USA. The database on floppy disks is available from NIST, OSRD, Physics Building, Room A323, Gaithersburg, MD. 20899 for $ 1050.00. The same database on a CD-ROM is available (catalog # Z21,399-3) from Aldrich Chemical Company, 1001 West Saint Paul Avenue, Milwaukee, WI 53233 USA for $ 1050.00.

16. S. E. Stein, P. Ausloos, and S. G. Lias, "Comparative Evaluations of Mass Spectral Databases", J. Amer. Soc. Mass Spectrom., 2, in press, (1991).

17. Chemical Concepts, Boschstrasse 12, PO Box 10 02 02, D-6940 Weinheim, Germany.

18. John Wiley & Sons, 605 Third Avenue, New York, NY 10158 USA.

19. W. Bremser, "Structure Elucidation and Artifical Intelligence, Angew. Chem., 100, 252-65 (1988).

20. Bio-Rad, Sadtler Division, 3316 Spring Garden Street, Philadelphia, PA 19104 USA.

21. Prof. S. Sasaki, Toyohashi University of Technology, 1-1 Hibarigaoka, Tempaku, Toyohashi 441, Japan

22. Dr. B. Derendjaev, Institute of Organic Chemistry, Prospect Lavrentiev 9, Siberian Division of the USSR Academy of Sciences, 630090 Novosibirsk-90, USSR

23. Fraser-Williams Ltd, London House, London Road South, Poynton, Cheshire, SK2 1NJ, UK.