The Status of Infrared Data Bases



Cherie L. Fisk and G.W.A. Milne, National Institutes of Health, Bethesda, Maryland 20205

and



Stephen R. Heller, Environmental Protection Agency, Washington, DC 20460





Abstract



Computerized data bases of IR spectra are among the oldest spectroscopic data bases to have been made available and used by the scientific community. Their use has not, however, lived up to expectations or been as extensive as that of other related tools, such as mass spectral data bases. The purpose at this paper is to describe some of the computer search systems that work with these collections of spectral data. The history of the major collections of condensed IR spectra will be outlined, and more recent work directed towards the compilation of collections of condensed-phase and gas-phase spectra, with or without the use ot Fourier transform techniques, will also be summarized.



Introduction



Infrared spectroscopy predates most other forms of spectroscopy as a useful tool for the analytical organic chemist in deducing the details of molecular structure. The technique has been most successfully applied for compound identification by the use of well established empirical correlations for specific functional groups and fingerprint patterns in particular spectral regions. A trained chemist, armed with a good collection of published IR spectra, can be a formidable opponent in any structure elucidation problem.



Although infrared spectral analysis does indeed depend on a good spectral library, creation of large high quality computerized IR data banks has lagged behind other such efforts, most notably in X-ray powder diffraction and mass spectrometry. The reasons for this are both technical and logistical. Infrared absorption is characterized by band-shape in addition to frequency, and thus a fully digitized spectrum rather than a simple series of line positions and intensities (as in mass spectrometry) should be used for matching. Digitized IR spectra have only recently become common as a result of the application of Fourier transform (FT) techniques. Thus, utilization of existing large spectral compendia is limited by the errors and costs associated with current techniques for after-the-fact digitization. The historical impetus for hard copy spectra has also raised difficult logistical questions. Published IR data bases have often been developed as commercial ventures. Therefore, questions of cooperation and pooling of spectra may be complicated by formalities not existing in informal scientific collaboration. This paper will outline the nature and status of existing IR data bases in order to highlight the problems that must be overcome to achieve standardized computerized data banks that will allow mutual exchange to everyone's benefit. Many of these topics were recently outlined at the 1979 Pittsburgh Conference on Analytical Chemistry and Applied Spectroscopy in a "Discussion of Computer Data Bases for Indexing, Storage and Retrieval of Infrared Spectra" sponsored by the Joint Committee on Atomic and Molecular Physical Data (JCAMP). A preliminary report of this meeting has appeared (1) and the final version is currently being prepared for distribution to participants (2).



Existing Data Bases



Condensed Phase Spectra



Infrared spectroscopy, like some of its sister spectroscopies, suffers from a computer data bank situation that is a consequence of a cottage industry approach used in the past. There are a number of small data banks of differing quality, along with one rather large data bank.



The large data bank, The American Society for Testing and Materials (ASTM) IR band index, contains about 145,000 spectra compiled between the early 1950s and 1974. This file is no longer being updated and is distributed by Sadtler Research Laboratories (3) as the Sadtler/ASTM IR spectral index on magnetic tape. The basic collection of 102,000 entries (AMD 33A) was increased in size in 1974 by an additional 43,000 entries with a final, 15th, supplement (AMD 33A-S15). This file contains numerous duplications of spectra, owing to the diverse sources of the spectra. The subfiles include:



(A) American Petroleum Institute Project 44

(B) Sadtler Research Laboratories Spectra

(C) National Research Council--National Bureau of Standards Atlas

(D) Literature spectra abstracted by ASTM-sponsored groups

(E) Documentation of Molecular Spectroscopy

(F) Coblentz Society Spectra

(G) Manufacturing Chemists Association Spectra

(H) Infrared Data Committee of Japan Spectra



As the ASTM computer readable file was designed in the fifties, it is based on an 80-column punch card format. Thus, data is limited to compound serial numbers (in lieu of names), IR peak positions, and what chemical and structural data could be encoded. Also, since microns were the units of line position being used 25 years ago, the file suffers from the further problem that the data are encoded to the nearest 0.1 micron unit, and when a peak (band) was found to be between two 0.1 micron intervals, it was arbitrarily coded at either the lower or higher 0.1 interval. As a result, there are numerous inconsistencies in the data. Furthermore, specific intensity data are missing owing to the lack of space on the punch card. In spite of these shortcomings, the ASTM file does provide useful results in many cases and has been used by many laboratories over the years. Searches of this data base ma! he accomplished with several systems (eg., SPIR. IRGO, IRIS) described in a following section. Each of these search systems also allows for addition of a user's own spectral collection in the ASTM format. Spectra matching uses major and minor peak positions and "no-peak" areas. The output of a search is a list of serial numbers of most probable matches, the user referring to the printed ASTM "IR Serial Number List-- AMD 32" (or its supplement AMD 32-S15) for the compound names. The lack of really extensive use of this data base, however, is no doubt due in part to the limitations imposed by the coding and the need for the printed reference materials.



Smaller collections include those produced by the following groups. Many are available only in printed form, necessitating some form of digitization before incorporation into a data base. Prior to 1974, as listed above, some of these collections were included in the ASTM index.



The Coblentz Society (4) has published over 10,000 high quality, evaluated IR spectra, many of which follow their published Class 11 guidelines (5). The spectra are available from the Society on 16 mm microfilm or from Sadtler Research Laboratories (3) in loose-leaf binders of 1000 each. Smaller collections for specific chemical classes are also available from the Coblentz Society.



The American Petroleum Institute (API) initiated their Research Project 44 in 1942 to collect IR spectra of petroleum hydrocarbons and nitrogen- and sulfur-containing compounds found in petroleum. The project is now directed by the Texas A&M University Thermodynamic Research Center, TRC, (6) as the TRC-API Project 44 with currently 3545 spectra available in 9 1oose-leaf binders. TRC also compiles a TRC Data Project IR spectral collection of organic compounds (excluding hydrocarbons) with an emphasis on heavy-use industrial chemicals. This was formerly the Manufacturing Chemists' Association's subscription publication. Currently, 1358 spectra are available on data sheets in 4 loose-leaf volumes. Both the TRC-API 44 and TRC Data Project collections are updated and expanded continuously with both the addition of new spectra and the replacement of older with higher quality new spectra.



The Aldrich Chemical Company published a second edition of "The Aldrich Library of Infrared Spectra" in 1975 with a total of over 10,000 IR spectra of compounds from the Aldrich catalog of chemicals (7). A third edition will be available in a year or so, adding approximately 2000 to 3000 additional spectra to the collection. Aldrich has created a file sequencing major IR peak maxima (two per spectrum) on magnetic tape that could be accessed by a search program. A set of microfiche created from this tape is available.



The IR Data Committee of Japan has published over 14,000 IR spectra on edge punched cards (8).



The "Documentation of Molecular Spectroscopy" (9) is a collection of over 22,000 IR spectra obtained from European labs. This printed publication was discontinued in 1973.



The "IR of Selected Chemical Compounds" (10) contains 1800 spectra obtained in the laboratories of the University of Freiburg. Updating of the collection has also been discontinued.



Sadtler Research Laboratories (3) has been publishing IR spectra for a number of years and has the largest collection of published IR spectra in printed form (or 16 mm microfilm) that is publicly available. Standard spectra on 9870 or purer organic compounds are published at a rate of about 2000 per year resulting in prism and grating collections of 57,000 each. Special collections (toxic chemicals, biochemicals, steroids, etc.) and commercial collections (adhesives and sealants, lubricants, solvents, etc.) currently add about 45,000 to the size of the Sadtler collection. In addition, this year Sadtler is publishing 1200 high resolution evaluated quantitative IR spectra that comply with the Coblentz Society Class II guidelines (5).



In summary, the largest and the most widely used data base is the ASTM index, but it suffers from several shortcomings that are not readily remedied:



(A) Spectra appearing since 1972 are not included.



(B) Only coarse spectral features are retained

(peak maxima at 0. 1 micron intervals, no relative intensity information).



(C) The spectral quality varies, depending upon the subcollection.



A wealth of printed or original spectra is available from the literature and the collections listed above, but to date no concerted effort exists to digitize these data in a uniform fashion that would be easily searchable by the scientific community. At least three digitization approaches are currently being tested as solutions to this problem. Block Engineering has embarked on a program to completely digitize the Sadtler collection of condensed phase IR spectra and expects to have about 50,000 spectra on magnetic tape in 18 months. A task force of JCAMP, chaired by C. Craver, is currently exploring the feasibility of preparing a computer-based index of IR spestra obtained since 1972. This project involves, however, the coding of only peak positions and intensities, rather than full spectra. In addition Clerc and coworkers (11) have developed a procedure for converting existing IR data into computer readable form, using a PDP-8 and a DEC writing tablet. This method appears to be both accurate and economically feasible and should lead to better data for future search systems. A group directed by Hippe of the Technical University of Rzeszow in Poland (12) has also undertaken a digitization project using a special high speed, semi-automatic digitizer. Also, at least one simplified scheme for after-the-fact digitization has appeared that could be easily employed in a laboratory lacking expensive computer tablets (13). Clearly, these measures are necessary only for recovery of data obtained prior to the advent of the on-line computers that now form a part of many spectrometers, both FT and dispersive, where spectra are automatically digitized and 'peak-picking' can be programmed.



Vapor Phase Spectra



In the area of vapor phase IR spectra, work is just beginning. Almost a decade ago Welti (14) published a collection of vapor phase IR spectra, but this collection was not continued and it is of minimal value. The Coblentz Society (15) has prepared a proposed set of standards for running vapor phase IR spectra and is about to publish a special IR spectral collection of gases and vapors that will contain 200 spectra of halogenated hydrocarbons and ubiquitous chemicals. Sadtler's collection currently contains about 2500 IR vapor phase spectra, with an addition of a further 2500 spectra planned for late 1979. The National Institute of Health (NIH) has contracted for the Polytechnic Institute of New York to run a few hundred vapor phase FT-IR spectra, and the EPA Environmental Research Laboratory in Athens, Georgia, has contracted for the running of some 2000 GC/FT-IR spectra through Sadtler Research Laboratories. It is expected that the latter two collections will become the nucleus of a high quality data base of vapor phase spectra, although liquid phase spectra may also be included as long as they are of the same high quality and are properly encoded.



There are a number of private collections of IR vapor phase spectra, but no organized effort is currently evident to look







into means by which these many small, but valuable private collections can be collected, integrated, and made available to the scientific community.



IR Search Systems



Over the years a number of computer based search systems for IR have been developed. These systems have had considerable testing and use, and some are available commercially. The more recent systems that have been developed have been "research" oriented systems, and as such, have been limited in their exposure and use. The four primary systems, all of which are sequential, batch oriented systems, are described below. The systems are based on the ASTM data bases of 102,000 or 143,000 spectra.



The FIRST-I system, developed by Erley (16) at the Dow Chemical Company, is a search program written for a number of IBM systems, the 1130 (on which the system was developed), the 1800, and the 360/370 series. The system comes as a complete package and may be purchased from DNA systems (17).



The Search Program for Infrared Spectra (SPIR) has been developed under the leadership of Jones, et al. (18) for use within Canada. The program uses the ASTM data base, and the algorithm used is the FIRST-I search system rewritten for the local computer facility in Canada. SPIR is now administered by the Canada Institute for Scientific and Technical Information (CISTI) of the National Research Council (NRCC) in Ottawa, with the Division of Chemistry of NRCC in Ottawa, with the Division of Chemistry of NRCC responsible for any spectroscopic aspects. The computer costs for a SPIR search run between $2 and $6 per spectrum (excluding telecommunications charges) with a minimum charge of $25 per month which includes a credit of S10 toward searches. The system will shortly be available across Canada to subscribers of ClSTl's Canada On-Line Enquiry service network (CAN/ OLE).



The IRGO system, developed by Craver is an on-line commercial search system, available from Chemir Laboratories (19) via an international timesharing network, running on the Tymshare Sigma 7 computer. This search system, which contains some 150,000 IR spectra, is used by dialing in via the network, or alternatively, a service is offered in which an expert will conduct searches for the chemist on a fee-for-service basis. The cost of the system varies with the number of searches conducted per month. A fee of S85 per month includes 3 searches with additional charges of S26 per additional search. Alternative plans are available if less use is contemplated.

The IRIS system, available from Sadtler Research Laboratories is a file of about 110,000 coded spectra. The data base, last coded in 1970, includes part of the ASTM file and the Sadtler standard prism collection. It runs on a Univac computer, is available from University Computing Corporation, and is networked. A quarterly data access fee of $75 to $200 (depending on usage) is charged, and searches vary from about $5 to $18 depending on the size of the data base selected for a search.



Each of the systems mentioned above suffers from the ASTM search system defects of having neither the actual reference spectral data nor the names of the "hits" from a search available. Even though the ASTM IR name formula index is available on magnetic tape from Sadtler (AMD 36 and supplements), it is apparently not used because of computer storage costs. Thus. one must go to a set of books to look up each spectrum individually. This is most probably a reason for relatively low usage of the systems. A further difficulty is that some of the books of spectra are not available for purchase any more.



Other search systems have been developed in the past and generally have been limited to private use. These include:



(A) EKIRSS systems developed by Kodak (20).



(B) SIRCH-360 system developed by Erley, on which the FIRST-I system is based. This is also distributed as the Sadtler/ASTM SIRCH 3 system, which includes 102,000 ASTM spectra.



(C) The EASS system developed by Edgewood Arsenal (21).



(D) The MIRET system developed by Penn State (22).



The different characteristics of the system are given in Table 1, prepared by R. Katz (23) for an EPA study of IR searches systems.



Combined Search Systems



More recently there has been a renewed interest in using IR in conjunction with other spectral data for the purpose of structure elucidation. These systems have been developed mainly in Europe and will now be briefly described.



Hadzi and coworkers (24) in Yugoslavia have developed a search system for IR, mass spectrometry, and CNMR, called COSMOSS. The COSMOSS system runs on a CDC CYBER 172 computer and is written in FORTRAN IV. The IR portion of the system uses 92,000 spectra, the majority of which are derived from the ASTM file. Using a test set of 100 handcoded spectra of chemicals known to be in the IR file, the system found 81 of the compounds where 77 were correctly listed as most probable matches. This IR performance of 77/100 compared with 91/100 for mass spectrometry and 95/100 for CNMR. This was attributed to the deficiencies inherent in describing the broader IR bands by ASTM coding procedures. In addition, a second IR search system developed in Yugoslavia for minicomputers uses fully digitized spectra of limited collections such as polymers. After a successful search, the actual spectrum can be displayed in analog form by this system (25).



Using PDP-8/DEC writing tablet converted data (cf., above), Clerc has developed the OCETH system for spectral interpretation, which uses IR, CNMR, UV, and mass spectrometry (11). The system runs on a CDC 6500, is written in FORTRAN, and uses about 2000 IR sDectra.



Conclusion



Both the number and quality of publicly available computer-readable IR data bases is limited. Updating of the ASTM index (about 145,000 spectra) was discontinued in 1974, and the data base contains limited spectral information. The Sadtler vapor phase and quantitative spectra amount to only a few thousand. Other smaller private collections cannot currently be pooled because of disparate formating procedures. Literally thousands of spectra have been published that have not been included in any data base.



Most search systems are specific to the ASTM file or private collections, limiting their wide-spread use. Years of experience in developing these systems in the US, Canada, and Europe is available, however, as a basis for future cooperative efforts.



Many members of the international scientific community including users of IR spectra data, compilers of computer databases, and generators of IR data in digital form are currently investigating possibilities for future cooperation under the leadership of the Joint Committee on Atomic and Molecular Physical Data.



Table 1. Characteristics of Known IR-Search Systems





Acknowledgment



The authors would like to express our thanks to Dr. R. Katz for the use of Table I and to Dr. E.D. Becker and the Joint Committee on Atomic and Molecular Physical Data for pro~ icing a preliminary copy of their report on IR data bases.





References



1. A. .L. Smith. Coblentz SocieryNews/etter. No. 78, May 15, 1979.



2. Joint Committee on Atomic and Molecular Physical Data. E.D. Becker, Chairman. Bldg. 2, Room 122, National Institutes of Health, Bethesda, Maryland 20205.



3. Sadtler Research Laboratories. 3314 Spring Garden Street, Philadelphia, Pennsylvania 19104.



4. Coblentz Society. P.O. Box 9952, Kirkv,ood, Missouri 63122.

5. The Coblentz Society Specifications for Evaluation of Research Quality Analytical Infrared Spectra (Class 11). Anal. Chem. 47: 945A ( 1975).



6. Thermodynamics Research Center. Texas ARM University, College Station, Texas 77843.



7. C. J. Pouchert The Aldrich Library of Infrared .Spectra. Aldrich C hemical Company, 940 West St. Paul Avenue, Milwaukee, Wisconsin 53231.



8. Nankodo Co. Ltd. Publishers and Booksellers, 42-6. Hongo 3chome, Bunkyo, Tokyo 113, Japan; Mailing address: P.O. Box 5272 Tokyo International, Tokyo 100-31, Japan.



9. Butterworth Scientific Publications, London WC2, England.



10. Mecke and Langenbucher. Heyden & Son Ltd., Spectrum House, Alderton Crescent, London NW4, England.



11. J.T. Clerc, R. Knutti, H. Koenitzer, and 1. Zupan. Z. Ana/. Chem. 238: 177(1977).



12. Z. Hippe. Private communication.



13. M.F. Delaney and P.C. Uden. Anal. Chem. 50: 2156(1978).





14. D. Welti Infrared Vapor Spectra. Heyden and Son Ltd. London ( 1 970).



15. Specifications for Infrared Reference Spectra of Materials in the Vapor Phase above Ambient Temperature. Prepared by the GCIR Subcommittee of the Coblentz Society Evaluations. P.R. Griffiths, Chairman, Ohio University, Athens, Ohio.



16. D.S. Erley, Anal. Chem. 40: 894(1968).



17. DNA Systems, Inc., P.O. Box 1424, Saginaw, Michigan 48605.



18. E.M. Kirby, R.N. Jones, and D.G. Cameron. Codata Bull. 21: 18 ( 1 976).



19. Chemir Laboratories. 761 Nhest Kirkham, Glendale, Missouri 63122.



20. D.H. Anderson and G. I.. Covert. Anal. Chem. 39: 1288 ( 1967).



21. E.C. Penski, D.A. Padowski, and J.B. Bouck. Anal. Chem., 46: 955 ( 1 974).



22. R.W. Sebesta and G.G. Johnson, Jr. Anal. Chem. 44: 260 ( 1972).



23. R. Katz, Feasibility Study and Evaluation of Infrared (IR) Search



Systems. Prepared for EPA Contract No. 68-01-2733, March 1975.



24. J. Zupan, M. Penca, D. Hadzi, and J. Marsel. Ana/. Chem. 49: 2141 (1977).



25. J . Zupan. Private communication.