Computer readable analytical chemical data - comments on a critical need



Stephen R. Heller
US - EPA
MIDSD
Washington, DC 20460 USA

and

Rudolph Potenzone Jr.
US - FDA
HFN-870
Bethesda, MD 20205 USA

Do academics, journal editors and research funding agencies place too little emphasis on the compilation of high quality data collections? Stephen Heller and Rudolph Potenzone think so, and contrast this situation with that found in industry, where large databanks aboubd.

We know that much of the research in analytical chemistry, as in other branches of chemistry, is done at universities. Analytical instrument manufactures throughout the world make extensive use of academics as consultants in the development of new, better, and more sensitive instruments. Government agencies in the USA and elsewhere provide funding to academics to conduct research in both established and new areas of analytical chemistry.

Contrast with the general situation in the area of analytical chemical data. Such data are derived from experiments being performed in academic institutions throughout the world. Suggest to an academic that they have a responsibility to make their analytical data readily available in a complete and high quality form, and one is likely to be treated less kindly than a witch in Salem in the seventeenth century. Suggest to a US Government agency that grantees provide such information to the funding agency, a professional society or some other such organization, and one might be made to feel that a new sin has been created. (One notable exception to this is in the area of X-ray crystallography data (Bergerhoff, G., Hundt, R., Sievers, R. and Brown, I. D., unpublished observations, also see Ref. 1).

Journal editors (and reviewers), with their interest in publishing the best scientific results in the minimum amount of space, seem to thrive on publishing longer discussions and fewer experimental details. Commercial publishers, looking to maximize their return on investment, seem to unquestioningly follow the editors on this matter. Science as a whole does not benefit from this approach.

Visit an industrial organization and one finds a most striking contrast. Large databanks (both computerized and manual) abound. Collections of spectral, chromatographic, toxicological, environmental, biological and other data for large numbers of chemicals have been, and are being collected, stored and indexed. The latest data collections in most fields are publicly available, along with modern computer systems to handle, manipulate, and model the data. Although the university laboratory usually has the same instruments and performs the same experiments, it tends not to have the many data collection available.

Why does academia ignore an area that industry stresses? Does industry know something yet to be learned at the university? Perhaps so. Why is concern (and effort) in (high quality) data collection viewed so negatively by academics, journals and research funding agencies? One simple and obvious answer is that data collection is not "new." That is, it is not "research" providing "new" results, hence it is not acceptable to academia in their quest for knowledge, nor to the funding agencies whose object is to fund innovative research. Of course, it is also unacceptable to the journal editors who want to maintain the high standards of scientific literature and publish only the best of new research. There are those who believe that perhaps 50% of the results in the literature are not reproducible, because of the lack of sufficient experimental detail.

The "information explosion" has been going on for at least two decades. We are now unable to browse the literature and abstracting journals because both the time and cost of such browsing are prohibitive. We are clearly overloaded in our ability to collect, evaluate, store, retrieve and digest all the scientific results coming forth. Yet, except in industry, this overload is not being adequately addressed. Computers can act as powerful tools to help handle this overload, especially with the advent of the inexpensive microcomputer or personal computer. But without anything to put into these computers, there is nothing these tools can do for us.

How then can we solve this problem? Perhaps one should first state what the alternatives might be if we don't. If academia has less information and knowledge, it is no longer in the forefront of science and new knowledge, how then will future generations be properly educated? Will the rich (industry and the biggest and richest universities) publish more than the poorer academic institutions, and will these smaller and less affluent institutions eventually disappear from academia? A recent article in this journal (2) has addressed the more technical and scientific aspects of this problem in the area of spectroscopic databases, but did not address the issue of how to obtain this data.

This is not the first time this problem has been aired, nor will be the last. An attempt to draw attention to this mattes was made in a guest editorial, published in Analytica Chimica Acta a year ago, entitled "Where have all the data gone? (3). Considering no response whatsoever was received to this editorial, except a 'thank you' from the editor, is this discussion likely to help? There are many scientists (mostly outside academia) who believe in this need (4). Perhaps, if this issue is raised frequently enough, a majority of analytical chemists and other scientists will one day become converts and will find a mechanism to make large amounts of (high quality) scientific data readily available, for everyone's benefit. Information is knowledge, knowledge is power, and in science, is perhaps best handled when widely disseminated and available. Modern society is becoming increasingly dependent on the fruits of science and technology. Experimental work in analytical chemistry is expensive, and unwitting duplication is not beneficial. Scientific data are needed for solving many key problems in the environment, energy production and general industrial productivity. If analytical chemists do not contribute by filling the void in information and data, there will be no solution.

How then can this critical need for the preservation of high quality analytical chemical data be met? While the suggestions proposed above are probably not well developed enough to solve the problem, it is to be hoped that they may act as a starting point for future discussion and development. It is easy to propose solutions, and while we do offer a number of 'straw man' comments, we hope these will be taken only as a starting point for discussion and debate within the analytical chemistry community.

A number of national programs have been initiated over the years in this area, but with a lack of the necessary funding to proceed at the level and rate needed, little progress seems to have been made. In the USA, the Standard Reference Data Act of 1965 (Ref. 5) let to the formation of the Office of Standard Reference Data (OSRD), within the National Bureau of Standards. This program has never been funded at the level authorized by Congress, nor at the level needed to cover the vast area of scientific data within the bounds of the mandate of the Act. Rather, OSRD has used its limited resources to compile and disseminate as many quality data collections as possible, maximizing the co-operation of academia and industry (6).

In the UK, the Cambridge Crystal Data Centre has with the co-operation of the Royal Society of Chemistry, become a repository for all X-ray data on organic chemicals (1). A similar data center for inorganic chemicals has been established as a joint, German-Canadian project (Bergerhoff, G. et al., unpublished observations).

Recently, in the USA the American Institute of Chemical Engineers have organized a Design Institute for Physical Property Data (DIPPR), in order to gather, determine and evaluate data on physical properties of compounds and mixtures processed by the chemical industry (7). The project goals are to collect this information on about 1,000 compounds by 1985. Already data on almost 200 chemicals have been made available to the members of the DIPPR. Those organizations who are not members will not be able to obtain the information until a year after the members of DIPPR first obtain it. While this is clearly understandable, since DIPPR members pay for this valuable effort, it prompts some questions. How will academia, and those small companies not able to afford membership of fees in DIPPR, be able to obtain the information on a timely or reasonably priced basis? Is there a need for a government and/or professional society (or societies) organization to handle such an effort?

In the USA, the American Chemical Society (ACS) has decided to put its resources (via Chemical Abstracts) into becoming an online vendor of its own bibliographic information, rather than extract and evaluate information from the journals abstracted by Chemical Abstracts Service (CAS). Is this in the best interests of the ACS members or organizations which support the ACS? If the ACS does not do this work, who will? Some publishers are now getting into this business, with the appropriate goal of trying to make this activity a profitable one. John Wiley produces and disseminates a mass spectral database, partly using data submitted by authors of its journals. Elsevier is heavily involved in the medical area with the large and well known database of bibliographic medical literature - Excerpta Medica. Will this type of activity on the part of publishers lead to only one source of scientific data? Can such data actually be copyright? What will duplicate data extraction activities do to the ownership rights of such data collections? What is the incentive for an owner of data, who desires to make a profit from such an effort, to attempt an exhaustive compilation of the required data, and to evaluate the data adequately? If data are found to be of poor quality, should they be excluded from data collections, included (with the quality problem noted), or should the experimental data be re-obtained under better conditions? Perhaps Government organizations and professional societies or associations can (and do) undertake such activities as, for example the industry-sponsored toxicology program -- CIIT, but can publishers, or other profit-making companies, do the same? Perhaps international co-operation is an answer (4).

As one can see, it is easier to define the problem than the solution. However, understanding the problem is, without doubt, the first step towards solving it. Hopefully, this presentation will further understanding of the problem, stimulate discussion and hence lead one step closer to the solution.

References

1. Allen, F. H., Bellard, S., Brice, M. D., Catwright, B. A., Doubleday, A., Higgs, H., Hummelink, B. G. T., Hummelink-Peters B. G., Kennard, O., Motherwell, W. D., Rogers, J. R., and Watson, D. G., (1979) Acta. Crystallogr., B35, 2331-2339.

2. Clerc, J. T., and Szekely, G. (1983) Trends Anal. Chem., 2, 50-52

3. Heller, S. R. (1982) Anal. Chim. Acta, 136, 1

4. Horvath, A. L. (1983) Chem. Eng. News, April 4, page 51 (Letter to the Editor)

5. Standard Reference Data Act of 1968, Public Law 90-396

6. Lide, D. R. Jr. (1981) Science, 22, 1343-1349

7. Chem. Eng. News (1983) Jan. 3, pages 34-36