Numerical Data Acquisition, Dissemination, and Retrieval Systems

Stephen R. Heller

USDA, ARS, Beltsville, MD 20705-2350 USA

Abstract

Scientific data acquisition, dissemination, and retrieval methods are continuing to increase. As both information specialists and working scientists realize the need for and value of good scientific data, more activity in this area is undertaken. This paper will describe some problems in this field, and will propose the creation of a knowledge base for spectroscopy as a means of helping to solve the problem of chemical structure elucidation.

INTRODUCTION

The topic of the 8th International Conference on Computers in Chemical Research and Education (ICCCRE) Key-note and Focused Discussion session on numerical data involved three areas. They are data acquisition, data dissemination, and data retrieval. Numerical Data suffers from a lack of all three, but mostly from shortcomings in data acquisition. By acquisition I mean obtaining the data, either from instruments or calculations, evaluation of the data, creating a complete description of the data, and putting the resulting complete record into a readily accessible computer readable form. Furthermore I also mean the intelligent and logical organization of these data in database that are more than just good and useful databases, namely knowledge bases. Knowledge bases, as used here for spectroscopy are correlations of chemical structure(s) associated with the spectral data.

DATA ACQUISITION

In the data acquisition process there are four areas for which one needs information before the data record is complete. These areas are:

1. Data Parameters

2. Actual Data

3. Experimental Conditions

4. Sample Description (e.g., purity)

The data parameters would be a list of initial and final values (such as in a spectrum), the minimum and maximum values (such as 0 to 100%, or if appropriate added decimal places), the units for the data, the number of data points, the chemical name (preferably using standard nomenclature), the Chemical Abstracts Service Registry number and/or other identification numbers, and the data format. The actual data are simply the numbers (with the proper significant digits) from the instrument or experiment. The experimental conditions, while obvious and clear, are often left out or reported in an incomplete manner. The sample description would also include the source, purity, method(s) of purity and identification verification, and so forth for the chemical. Ideally all of these items would be included in a database. Sadly it must be said that this is not the case and a truly complete record rarely is found. This is partly because of a lack of knowledge and understanding of database activity, which I define as the generally narrow view of many who create databases. By this I mean that many databases have been put together by a researcher in the field who has little or no interest in data per se, but rather only is interested in the specific data of their field of interest. There is also the lack of resources, which is a more difficult problem to solve.

In almost every computer system, be it a dedicated data system on an instrument, an in-house computer system, or a large online system, there are two main problems. The biggest problem is the lack of sufficient data parameters on a large number of chemicals. The quality of the data is a close second, but it is the opinion of the author that the main problem is the lack of data. This lack of sufficient data has been the case for many years with little reason to believe there will be any change. One possible solution to this would be to make the maximum use of what data there are by taking these data and information and creating a knowledge base of spectral correlations. This matter will be addressed in more detail later on in this paper.

Before continuing on about data quality and/or data completeness, it is useful to say a few words, by using an example, about why it is really important. At a recent database and online meeting on CD-ROM (Compact Disk -Read Only Memory) technology there were considerable discussions on what is available on CD-ROM. There is a great deal of interest in putting all types of databases on CD-ROM's for ease of dissemination, ability to handle large volumes of data, and so forth. When discussing the cost of a CD-ROM, there is the pre-mastering cost, the mastering cost (to prepare data for the master disk which will then be used to make all the copies), and the copying cost. To make a point, the answer was given by discussing the steps in reverse order. Once the master disk is prepared, the actual cost to copy a disk is a few US dollars (perhaps $3- $5). The mastering cost, which requires a machine which costs perhaps $500,000 to $1,000,000 is about $1,000 - $2,000. The pre-mastering costs were estimated at $0 to $100,000. The $0 cost was for the example yet to be achieved. That is if you have a "clean" and complete database, then there is no problem, and hence lower costs. However, many CD-ROM projects start out by first having to create the database. Data creation or data acquisition, is the expensive, time consuming, and labor intensive part of a database project.

Data Quality

The situation with data quality is improving, albeit slowly. As time goes on the community is realizing that good data acquisition starts from the time you begin to plan the database collection activity. Work on the mass spectral database Quality Index (QI) (1) has continued under the leadership of Bill Budde of the US EPA. Budde has analyzed the meaning of the elements which make up the QI, and has made some modifications based on his experience (2,3). This is as it should be in science. When new information is obtained or old data are reanalyzed, changes often occur. This is because the QI is not an absolute measure of quality, but rather a relative measure, and most importantly, a statement that the quality is defined. In the area of NMR, Bremser (4), after discussions at the previous ICCCRE in 1985 (5) has published a first version of a quality indicator for 13-C NMR spectra. This quality indicator is composed of five components, but does not yet include any component which is concerned with sample quality, preparation and source, and instrument conditions, and other parameters which would affect the quality of a spectrum.

In the area of infrared (IR) spectroscopy, a first step is now underway under the direction of Wilkins and Griffiths (6) to create a quality index for FT-IR (Fourier Transform IR) spectra. (Non FT spectra will be accepted, but given an arbitrary lower rating, assuming all other parameter evaluations and factors are the same.) These researchers have decided not to accept any data which was not collected directly in computer readable form as these spectra are much more subject to clerical data errors. They have proposed a three-digit quality index. Each digit lies within the range of 0 (low) to 9 (high). The first digit represents a measure of the sample authenticity and purity. The second is a measure of the sample preparation. The last of the three digits refers to the instrumental operating conditions. The authors believe this will allow the user of this proposed database to discriminate easily between, for example, badly measured spectra of pure samples, and well measured spectra of relatively impure samples. The draft questionnaire used to determine the values of the three digits is seven pages long. By the time these scientists and their advisory committee make the first draft available to the public, it will probably be longer and more detailed. Of course a more detailed list of information which is requested to be filled out may very well decrease the cooperation and response the authors receive. This desire or need to have detailed information on each spectrum is another reason that a knowledge base which contains correlations based on many observations, but not containing the details necessary for a library spectrum has considerable attraction and value. But still this effort should be encouraging to the IR community in particular, and the scientific community in general.

I believe that as each of these data evaluation projects is started and maintained, it will lead both to improvements in defining the quality of scientific data, and additional pressure on other scientific groups to develop a quality index parameters for their data.

In addition to having better databases I also believe these data quality evaluations will allow for the development of knowledge bases. This is because as evaluators begin to examine collections of data for determining how the data are best evaluated, they will begin to think in terms of correlations from the existing data collected. These collections will, in time, lead to the creation of what are called knowledge bases. As stated above, in simple terms a knowledge base can be defined as a collection of correlations of data and the partial or full chemical structure(s) associated with the data. Appendix 1 describes a draft proposal I have proposed to create a knowledge base of spectral data which will consist of spectral data and the structure, sub-structure, and/or structures which are associated with the data. This knowledge base will be a valuable resource to all spectroscopists. The knowledge will also identify "data gaps", which are areas for which there are little or no data on which to make any correlation, or areas in which one is not be able to make an acceptable correlation. (Having only one example of a data value makes it a bit hard to have a clearly valid correlation, and even more difficult to have an acceptable standard deviation.) Lastly, the knowledge base will help in the data evaluation by identifying possible data outliers, which would make the scientist examine the data and experimental conditions more carefully and closely.

DATA DISSEMINATION

The subject area of data dissemination is easier to deal with, since the matter of obtaining and preparing the data are over at this point. There are three main media with which to disseminate data. They are:

1. Computer tapes

2. Floppy Disks

3. CD-ROM

Computer tapes offer an excellent way to disseminate large databases. Today many labs can read standard 800/1600/6250 bpi 9 track tapes. As long as the provider of the data documents the tape format and data elements it should be an easy matter to read a tape of data. It is usually up to the user to create a searchable version of the data. However, in some cases it may be useful or desirable to provide the user with search and retrieval software. Generally most search and retrieval software is written by and for the programmer in the laboratory and the particular environment of that laboratory. Thus most software written do not perform the all functions desired, or performs them in a unique manner which is tailored to a particular environment.

With the enormous growth in the number of personal computers, it is becoming more likely that databases (and in the cases of large databases - subsets of databases) will be made available on floppy disks. The data on these floppy disks can also easily be put onto hard disks. With storage capabilities of 360 K to 1.2 MB for 5 1/4 inch floppy disks for IBM PC's and their equivalents, and 720 K - 1.44 MB for the newer 3 1/2 inch floppy disks, quite a lot of data can be made readily available to anyone. IUPAC has recently established a new committee called the Committee on Chemical Databases (7). This committee is planning to disseminate IUPAC databases on tape, and floppy disks. Since most of the IUPAC databases are small, the floppy disk is an ideal medium for dissemination. It is also ideal because most members of IUPAC has, or has access to, a computer with a floppy disk. The National Bureau of Standards, Office of Standard Reference Data (NBS - OSRD) is about to release a copy of their mass spectral database, along with a search and display software system which will include a database of about 45,000 spectra (8). All this will easily fit on a 20MB hard disk. The search times are quite fast (in the order of a few seconds) and at a price of $750, this seems likely to increase the use of such data and the general interest in numeric spectral data.

CD-ROM (Compact disk - Read only memory) is a dissemination medium of great potential. While there are few CD-ROM disk drives in computer systems today, their numbers are expected to increase considerably in the next few years. The cost of mastering a disk is high, but if 100 - 2000 copies can be sold, there are great economies of scale. At present the publishing company of John Wiley & Sons is experimenting with a mass spectrometry database on a CD-ROM, and the number of sales in the first few months is low. But Wiley should be given credit for exploring the market and seeing what the community wants. The Wiley system comes with a search program to search and display the resulting search hits.

Before ending the discussion on data dissemination it is worth noting an interesting and valuable project of the US Joint Committee on Atomic and Molecular Physics (JCAMP) (9). The JCAMP-DX (DX stands for data exchange) is a project for having a common data exchange format for IR spectra from different instruments and manufacturers. This would allow IR data to be readily exchanged via tape, over telephone lines, or by other means. The most promising aspect of this project is the format exchange appears to be broad enough to be used for other forms of data exchange.

RETRIEVAL SYSTEMS

There are few online retrieval systems for scientific numerical data (10). Most of the online systems, such as ORBIT and DIALOG, are almost exclusively bibliographic. The NIH/EPA Chemical Information System (CIS) (11) is a major source of online numeric databases, as is the Numerica system from Technical Data Services (12). The Beilstein Institute is developing its own numeric data retrieval system for use with the Beilstein Handbook database, which will become available online in 1988 (13). There are other systems, but these are all much smaller than the CIS or Numerica systems at present, and generally contain only one type of data, such as crystallographic data, powder diffraction data, or chemical Material Safety Data Sheets (MSDS). In the future it is expected that the Chemical Abstracts Service (CAS) Science and Technology Network (STN) will expand the capabilities of their Messenger software so it will have the capability to search scientific data (such as the Beilstein factual database) and the bibliographic text data it is now able to search.

CONCLUSION

The data acquisition process continues to improve, and data quality and data evaluation are getting more time and attention from scientists and database organizers. The development of quality index values for spectral databases is increasing in the numbers of spectral data areas covered and is becoming more routine in being accepted by scientists. It would be desirable for the process to move even more quickly, and hopefully more discussion, and more use of these types of data will speed up this process.

The need to proceed to the next step of creating knowledge bases of from spectral databases is an idea whose time has come. It is expected that such a project will be started in the near future and will result in both more valuable information and knowledge for the spectroscopists and a better understanding of the data contained in a spectrum.

Data dissemination is being handled in an acceptable manner, with CD-ROM technology having a promising future. The searching and retrieval of numerical data are generally so far ahead of the databases that this is an area which requires less attention. No matter how good or efficient the computer software may be, it cannot overcome poor, missing, or incomplete data.

Appendix 1

A Proposal for a Knowledge Base for Spectral Interpretation Systems

INTRODUCTION

There have been many efforts over the past three decades to use computers for spectral interpretation and identification (14). These efforts have centered around two areas, namely library searching and expert systems.

The first area is library search identification using spectral databases. Library search systems are now commonly available as part of mass spectrometer, infrared spectrometer, and other spectrometer data systems. They are also available in online timesharing systems. These systems, while varying in quality and use, all suffer from the lack of a sufficiently large, diverse (in terms of different types of chemical structures), and high quality of data so as to limit their use.

The second area is Expert Systems (ES), which are part of the field of artificial intelligence (AI). The DENDRAL (15) program was the first of these systems. As Zupan has pointed out its actual success has been somewhat less than claimed (16). Other systems, such as CASE (17) and CHEMICS (18) also have had limited success and certainly are not used routinely in labs. The major problem with these systems has been that no one laboratory has the resources to create a large enough knowledge base to interest a broad user community in testing, evaluating, and using the system. While AI and ES are in vogue, and considerable interest, technical and commercial, has been shown in ES in the past few years, most of this has been form, not content. Existing ES are a primitive technology primarily because they lack large and high quality knowledge bases.

In his Presidential Address at the 1985 American Association of Artificial Intelligence meeting (19) Woody Bledsoe stated: "I believe that it is time to build large, very large knowledge bases. Such a knowledge base should contain common sense knowledge and encyclopedic and expert knowledge and be structured to handle learning and performance requirements ... It is believed that such a large-structured knowledge base would not only allow the sharing of knowledge by numerous systems, but, if structured correctly, could provide much more robustness and functionality than is possible from a number of distinct smaller knowledge bases."

PROPOSAL

It now appears reasonable to conclude that the most critical need for an expert system for spectral structure elucidation is a large and diverse knowledge base. Such a knowledge base, encompassing the areas of mass spectrometry, IR, NIR, NMR (13-C and H), UV, VIS, and possible others, such as Raman, would be of

immense and critical value to the scientific community. There are no current projects which this proposal would duplicate. There are a number of spectral database projects, in the public and private sector (20), but these are "just" databases. Instrument companies have been experimenting with the use of ES for instrument tune-up and control (21), and one has even incorporated an existing ES into their data system (22). There is an excellent collection of tables of structure-spectral correlations which has been published in book form (23), but is not yet computerized. Hopefully this information could be used as part of the knowledge base to be developed.

It is proposed that an international collaborative project be established to create and maintain a knowledge base of spectral-structure relations. Such a knowledge base will contain the accumulated wisdom, knowledge, and experiences of experts in all fields of spectroscopy. In effect we will be creating a Ph.D. in spectroscopy. Like getting a Ph.D., this will take time - a number of years. However, besides creating a valuable scientific resource, this effort will identify knowledge gaps; that is, areas for which there are no known or agreed to structure-spectral data correlations. These gaps can become areas for new scientific research and investigations in the particular area(s) of spectroscopy identified.

It is suggested that the responsibility for the knowledge base maintenance and distribution reside with the ARS Model and Database Coordination Laboratory (MDCL) in Beltsville, Maryland, USA. MDCL has available a VAX/MicroVAX system which is networked via Telenet, BITnet, and ARPAnet, allowing for easy access to send in knowledge rules for the knowledge, to discuss the accuracy and validity of rules via electronic mail and electronic conferencing available on the system, and to allow for dissemination of the knowledge base, via the network or on magnetic tape. It would be expected that all collaborating members would be provided with a copy of the current knowledge base as "payment" for their contributions. Details of how others would obtain a copy will need to be worked out. One likely possibility would be to have IUPAC act as the organization for the dissemination of this knowledge base. With the recent establishment of an IUPAC Committee on Chemical Databases (7), there is both a potential source of seed funding for the project and a well known and respected scientific organization to lead the project. CODATA and JCAMP, two other organizations involved with scientific data and spectroscopy, cou