Stephen R. Heller
and
Lewis H. Gevantman
The scientific community is currently
undergoing phenomenal changes in the
way it produces, compiles, and disseminates scientific
information and data. Much of it can now be managed by
computer. Similarly, bibliographic literature searching has
become routine using a variety of on-line systems
(see
box).
A newer activity that is starting to make its presence felt in
the scientific community is that of providing the scientist with
access to numerical data through a variety of automated
mechanisms. Several numerical data bases have emerged and are now
being used. However most of these data bases remain unique--some have search
and retrieval capabilities, but for the mostpart they remain
free-standing bodies of numerical data without connection to
one another.
A decade ago the idea of combining related numerical data bases
for the environmental science community (1) was activated first
at the National Institutes of Health and subsequently by the EPA.
Coordination with other government agencies led to a system of 22
data bases and software analysis programs entitled Chemical
Information System (CIS). This system, which is under private
operation, is available to the environmental scientist. However,
some of the newer computer capabilities and improved data
management designs are not available in the system.
A numerical data base is defined as a collection or compilation
of numerical data that describes either singular or multiple
physical parameters of a specified chemical substance. Generally,
the number of parameters and substances are limited. The data
base may have simply a listing of numbers or it may have search
and retrieval software for extracting the desired data value.
Some come with calculational capability for extrapolating or
interpolating values based on theoretical or empirical grounds.
Most of these data bases are free-standing and each may require
the use of a unique approach for extraction of the information
contained in the data base. This makes for the investment of
considerable time and effort on the part of the user to learn and
manipulate each data base. In addition, the data contained in
such data bases are culled from the scientific literature and
other sources with minimal attempt to evaluate the quality of the
data. Also, some data bases are created with little or no regard
for adding to or updating their capability. This tends to render
the data less useful with the passage of time
On-line dissemination
A better approach to on-line numerical data dissemination for environmental scientists has been initiated by Technical Database Services, Inc., (TDS, see box) in their Numerica system. In contrast to the individual free-standing bodies of data bases available to users, the Numerica system is one of a cluster of experimental data bases, each of which has been scrutinized and accepted by the data base developer for quality and timeliness. Although each data base comes from a different source, inconsistencies between data bases (primarily in the area of nomenclature) are resolved by the use of Chemical Abstracts Service (CAS) registry numbers. Furthermore, each part or data base is closely related to the, other so that the data can be checked for consistency within all data bases. Programs are available to round out an compare calculational versus expert mental values with further opportunity to extend the data into areas where no data exist. The search and retrieve strategies for accessing the data bases are also rendered in a consistent manner, so there is no need to learn how to use each data base. Consequently, the buildup of data bases clustered about a centralized theme such as environmental science presents a highly useful systems concept.
Highlights of how the Numerica system searches for specific
chemical properties data illustrate the system's utility in
meeting the needs of the environmental scientist. For example,
the chemical 1,l'-biphenyl (molecular formula C12.H10) has been
assigned CAS registry number 92-544. Access to the Numerica on-line system is achieved in the usual manner through a dial-up
telephone network, in this case Telenet.
What data base to use
Before beginning a search of Numerica data bases, one must check
to see which, if any, data base contains data on the chemical of
interest. This can be done by using SYNDEX, which contains an
index of CAS registry numbers, chemical names, synonyms, and TDS
data base tags. This allows the user to save both time and money
by avoiding systems that do not contain the needed data. The
results of the search show that numerical data on l,l'-biphenyl
can be found in the Thermodynamic Research Center Database, the
Carcinogenic Information Database for Environmental Substances
(CIDES) (see box), the Environmental Fate Database (EFDB) (2),
and the Log P and Related Parameters Database (3). Illustrative
excerpts from three of these data bases are shown in
Table 1. In
performing a search for data and information on 1,1'-biphenyl,
the first type of data to be searched is environmental fate data.
Yet another data base, CHEMEST, can be used to calculate many of
the parameters described in
Table 1. CHEMEST shows property
relationships and fills in for missing data.
It is hoped that the searches and displays shown have conveyed
the value of having high-quality data that are easily searched in
an on-line system. Numerica is a simple system to use, and having
the SYNDEX data base as a front end allows one to perform a quick
and efficient search of available data before even going into a
specific data base. Lastly, the ability or potential to compare
experimental and calculated data (e.g., in the TRC Datafile) is a
valuable resource.
Looking ahead
The future direction for Numerica is twofold. First, there is the
need to bring other useful and important data bases into the present
cluster. By integrating the new data with the
present files, an ever-increasing numerical data capability is
available to the user with a minimum investment of time and
effort. Second. the cluster of environmental data permits the
formation of other clusters of numerical data that may relate or
be completely independent of the environmental data. The option
to select the theme of the next cluster is obviously predicated
upon the availability of data bases and the ability of such data
bases to serve a need expressed by the user community.
For example, a new cluster aimed at chemical engineering,
chemical manufacture, and data that embrace chemical and
pharmaceutical concerns is a natural extension of the Numerica
system. Obvious relationships to the Physical Properties Data
System and other data bases would then promote the use of both
data clusters because of the need to satisfy data requirements of
a mutual nature. Similarly, a biologically oriented cluster would
again have implications for relating the contents of the new set
of data bases to the existing cluster. The flexibility and
advantages described for constructing clustered data bases are
believed to be clearly superior mechanisms for producing online
systems in the future.
References
(1) Heller, S. R.; Milne, G. W. A. Environ. Sci. Technol. 1979
13,798-803.
(2) Howard, P. H.; et al. J. Chem. Inf. Comp.Sci. 1982 22 38-44.
(3) Leo, A. J. J. Chem. Soc. Perkins Trans. 11,
1983,825-38.
Stephen R. Heller is a research leader in the model and data base
coordination laboratory of the Department of Agriculture,
Beltsville, Md.
Lewis H. Gevantman is a guest researcher at the National Bureau
of Standards in Gaithersburg, Md. He retired from federal
service after 17 years as program manager in the NBS Office of
Standard Reference Data.