Stephen R. Heller
USDA, ARS, BARC-W
Bldg. 007, Room 56
Beltsville, MD 20705-2350 USA
(301-344-1709)
Telex: 258 594 MDCL.UR
Telemail: SRHELLER
BITNET: SRHELLER@UMDARS
Abstract
Questions and issues about property prediction are addressed and discussed. Issues such as reliability, evaluation of predicted values and how to handle different/multiple values predicted from different methodology, and how the data should be presented to the user community are critical to a project of the scale of the Beilstein database of chemicals. Lastly a proposal is presented as to how to initially address large-scale prediction of properties of chemicals.
Introduction
In the "best of possible worlds" (1) a chemist who discovers a new
compound would analyze its structure unambiguously and obtain
accurate data on at least 50 of its most important chemical and
physical properties. We are here in Bolzano because we do not live
in the best of possible worlds. The Beilstein Institute exists
because, in spite of Candide's optimism, this is not the best of
all possible worlds, scientifically or otherwise. The best reason
I can give for the lack of published property data about a compound
is that most chemical compounds (in fact some 75%) are only
reported once in the scientific literature (2). Some 15% are
reported twice, which leaves only 10% of the entire known universe
of chemicals for which there are more than two literature
citations. Thus, it is quite clear that since one probably
couldn't even readily obtain an authentic sample of most chemicals
in the Beilstein database, the issue of the very costly and time-consuming efforts to experimentally obtain the data desired is
really irrelevant. Hence, the very clear reason for this workshop
is to find reliable, consistent, and easy to use methods for
chemical and physical property prediction.
Property predication is not new. One of the first and probably the best known chemical property predictions was made some 117 years ago. It is quite fitting that the home of Professor Friedrich Konrad Beilstein, St. Petersburg, was also the place which gave rise to property prediction, the subject of this Beilstein Workshop. In 1871, Dmitri Ivanovitch Mendeleev published his prediction (Figure 1) of the existence and properties of Eka-aluminum, Eka-boron, and Eka-silicon (3). Within 15 years, these elements and their properties were discovered and the predictions of Mendeleev were shown to be rather accurate. No doubt those were simpler times and the task the Beilstein Institute wishes to undertake is of much greater complexity than the elements in the periodic table.
Background
The Beilstein Institute factual database now being put into
computer-readable form contains some 400 parameters, some of which
are shown in Figure 2. As we all know, there are three stages or
steps in the creation of a complete database of numerical data.
These are:
1. Collect Experimental Data
2. Evaluate Data
3. Fill in Data Gaps
For some 100 years, the Beilstein Handbook activities have been
involved with the first two of these stages, that is the collection
and extraction from the scientific literature and the evaluation of
the data. Now as one looks at the Beilstein Handbook of over 350
volumes the question has arisen, what about the "data gaps". Can
something be done to fill in the blanks? To put some perspective
on the magnitude of the "data gaps" in one particular area,
solubility, it is worthwhile to mention what Horvath (4) said in
his book on hydrogenated hydrocarbons:
"Despite the great demand for solubility data by scientists and engineers, experimental values published in the open literature are very limited. Regarding the availability of solubility data for halogenated hydrocarbons in water, Beilstein (4th Supplementary Series, Volume 1/Part 1 (1958), Volume V/Part 1 (1963), and Volume V/Part 2 (1964)) cites 1369 compounds up to six carbon atoms, of which only 61 have information as to their solubility, mostly for a single temperature only."
In some cases methods have been developed for the prediction of a
property. In other cases there has been no reported research or
activity for predicting a particular property. Prediction should
not be confused with prioritization. As Bill Milne at the NIH
Cancer Institute will discuss in a later talk, it is often very
useful to know the relative values of a given property. While this
is definitely property estimation, it is not the sort of estimated
values which would fill the "data gaps" in the Beilstein database.
Also, in this matter of relative values of a given property, it
could be harmful if non-experts see relative numbers and mistakenly
take these values to be real and absolute numbers.
While predictive methods or procedures can be developed without data they can only be tested if one has data. Such data must also have several attributes. First, they must be accurate and precise to test a given hypothesis on how to predict a numeric data value. Second, there needs to be a reasonable number of data points so that there is some "weight of evidence" to the prediction. While two points will mathematically give a straight line, in chemistry (as well as other disciplines) more data are required before the scientific community feels the predominance of evidence is correct. Third, the data should cover a broad range of chemical classes to have maximum any predictive value. In preparing this presentation I searched the literature from 1967 to early 1988, using the Online Chemical Abstracts database, for property predictions. I also talked with some colleagues about this subject. If this literature search and these discussions were all one knew about organic chemistry it would seem that the field was concerned only with hydrocarbons and simple mono-functional groups compounds with fewer than 10-15 non-hydrogen atoms. Most predictive methods are useful as teaching examples, rather than being able to fill in the "data gaps" in the Beilstein Handbook or elsewhere. My research is concerned with the creation of a database of chemical, physical, and other properties of pesticides for use in models which will predict possible contamination of our nation's groundwater. With few exceptions (methyl bromide and 1,2-dibromoethane (EDB) being the only two I have found so far) most pesticides contain over 10 non-hydrogen atoms and 2-3 elements in addition to carbon and hydrogen. Thus, for this research to proceed new or expanded predictive capabilities are needed. Before this can occur, at least in the pesticide chemistry field, one needs better data for the existing compounds. One should not try to extrapolate from a vacuum into the real world.
How does one choose which method for property prediction to use?
Are the known methods valid for the entire range of organic
chemistry? Of course not. Have the authors explicitly stated what
the limitations of the methods are? What is the reliability of a
given method? What are the error ranges for the predicted data?
Scientists are notorious for reporting calculated numbers to beyond
the range of significant figures. What is necessary to be sure
predicted data are properly presented? How was the method
developed? If the method is based on calculation of another piece
of data for input, what is the reliability and error range
associated with the input data?
One of the first and most important properties being collected for
the Agricultural Research Service (ARS) database is the aqueous
solubility of a pesticide. Highly soluble materials are rapidly
distributed in the soil and can be easily transported to the water
table. There are a number of methods for estimating solubility.
Five basic methods are described by Lyman and his colleagues (5)
Chapter 2, Table 2-1. However most give an estimated value at
only one temperature (25 oC), and "few have actually been presented
(and tested) as predictive tools" (5, page 2-1). Furthermore,
issues such as "relative merits, applicability, and accuracy" of
these methods had not been reviewed prior to the work of Lyman and
his colleagues (5, page 2-2). For example, the PC-GEMS program
(6) first calculates an octanol/water partition coefficient (LogP)
from a two dimensional structure input. It then calculates a
melting point from the two dimensional structure and the calculated
LogP value. From there the water solubility is calculated. As
seen in Figure 3, the predicted water solubility is given to two
significant figures, without any justification that the prediction
is that accurate. Another estimation program, CHEMEST (7), which
is based on work published in the book by Lyman and his colleagues
(5), provides for calculations of 11 different properties using 36
different methods. This system, CHEMEST, fares much better with
respect to providing information about the limits and accuracy of
the methods and the calculation or estimation errors. Figure 4
shows the same calculation of solubility using the CHEMEST program.
Both programs use the same procedures, but CHEMEST provides the
user with information on the error associated with the method.
Recently I was told (8) of a chemical company which decided to re-run some partition coefficient data for several acid anilide
pesticides. The chemicals were from their own company, as well as
from other manufacturers. They designed and ran the experiments
very carefully, with the proper quality assurance and quality
control. When they compared their results to the predicted values
using the CLOGP program (9), there were sufficient differences to
warrant some concern. As a result, they discussed the matter with
the world's authority in the field, Professor Corwin Hansch. After
seeing how the experimental data were collected and what values the
CLOGP predictive program generated, the CLOGP program was revised
by Hansch and his coworkers to take into account high quality
experimental data.
How many and what type of compounds were used to test the validity
of the method? How accurate were the data used to develop and/or
test the predictive technique? How does one prove that the
predictive method used is accurate? In a talk at this workshop
Peter Jurs from Penn State University (USA) will describe some of
his research activities in this field, including a recently
reported study (10) on predicting olefin boiling points from
molecular structure. Jurs wisely chose these compounds because
there were a reasonably large number of compounds (123) available
and the data were of high quality. The method he has developed for
the class of compounds studied appears to have solved the problems
encountered with earlier prediction techniques (7,11) which were
not able to handle many isomeric compounds. In contrast, the data
in the Beilstein Handbook, while evaluated, are collected in a
random fashion in terms of a class or series of compounds. Also,
with rare exceptions the data come from many sources, published
over many years, and using many different experimental conditions.
Thus, it may not be possible to obtain a large number of similar
compounds with enough identical properties from the Beilstein
database to assure that a predictive method is properly tested and
evaluated. Unquestionably this is a handicap which must be
overcome. Without sufficient and good experimental data,
predictive methods must be viewed very carefully.
Assuming satisfactory answers to these questions, let us now
proceed to the question of data quality or reliability. How should
the predicted results be evaluated? Since all the experimental
data which go into the Beilstein Handbook are evaluated, it is
reasonable to assume that methods must be developed to evaluate the
predicted data. What does one do in the case where two (or more)
methods are believed to be scientifically valid, yet yield
different answers? We are beginning to develop a series of expert
systems for data evaluation. The process will be based on our
SELEX expert system (12) which provides objective and consistent
evaluations of published data on the selenium content in foods.
Our first data property expert systems will be in the areas of
solubility and vapor pressure evaluations.
Once agreement is reached that a number for inclusion in the
Beilstein database, how will be it noted or tagged in the database?
Will it be clear to the user that the information or numeric value
is not an experimental value, but rather in a predicted or
calculated value? Certainly a clearly marked reference citation
should suffice, but how can one be sure that an entry transferred
from the Beilstein database to a report (or a value from the ARS
Pesticide Property database used as input for some model or other
purpose) is properly referenced and properly used? Should
experimental data be in a separate section of the Beilstein
database to help assure the user notices the difference? Should
there be a notation in the record saying, for example, "No
experimental data, please see predicted value given below"?
Should interpolated data be noted as such, as compared with
extrapolated data? How will the evaluation criteria take such
differences into account? What will happen when an experimental
value for a particular parameter is found? Will the experimental
value automatically replace the predicted value? What if the
values are far apart? In some cases errors can be a few percent or
less (for interpolation), but can orders of magnitude (for
extrapolation). What might this imply about the method used for
the prediction? Will there be a notation in the record that the
newer, experimental value is a replacement value? Should the
original predicted value be kept in the database?
When an experimental value is found to be considerably different
from the predicted value, what should be done about the predicted
values for other, similar compounds in the database (or the other
compounds in the database which have data values predicted by this
method)? If the reliability of a method comes into question later,
how easy will it be to change all the records in the database found
in a number of online systems which use this method?
One should also ask if the method is automated. If not, can it be automated? For a method to have any possible practical application for the Beilstein Institute, and be used with such a large database as the Beilstein Structure Registry Connection Table database, computerization is essential. Is the two-dimensional structure sufficient for input into the prediction method?
Responsibility
Who is responsible for the predicted data? When an error is found,
should it or must it be quickly corrected? Certainly a computer
program can regenerate a large set of predicted values in a very
short time. Can this be done as a practical matter and will this
be done even if it is costly? If it is done, what guarantee is
there that the online vendors of the Beilstein database will
quickly replace the older or incorrect data with the corrected or
new data? Dealing with scientific data implies a greater
responsibility than is normally taken for with bibliographic
abstracts.
How will the scientific community accept a database with many
entries of predicted values? How will Government agencies, in the
US, Europe, and elsewhere accept such data? What will be the
effect of such data have on patents and patent rights? Will
predictions be considered, under any conditions, as "prior art"?
Proposal
Now that the less positive aspects of property prediction have been
raised, I would like to propose some possibilities for future work.
The research falls into two distinct areas. The first is creation
of collections of high quality databases for a series of class of
multi-functional group compounds. This is essentially what we are
doing with the ARS Pesticide Database, since accurate values for
parameters such as solubility and vapor pressure do not exist for
most pesticides. Results from solubility experiments run under
conditions of different temperature, pH, and ionic strength will
give us the necessary data input for the wide range of agricultural
conditions which exist. We then hope to use these results as the
foundation for developing accurate predictive methods. In other
areas, such as bio-medicine and pharmaceutical chemistry a
parameter like the octanol-water partition coefficient (LogP) may
be considered to be of high priority and importance. LogP also has
been proposed to be of potential value in the prediction of aqueous
solubility (4,9).
I would hope that the Beilstein Institute, with the support of the
German Government and others, would fund several such projects as
prototypes to see how useful some high quality data for a number of
parameters will be in creating as broad a predictive strategy as
possible.
Conclusion
This presentation discusses several difficult issues associated
with the wide-scale use of predicted property data for a large and
chemically diverse database. While the overall state-of-the-art is
in its infancy, and is quite limited in its current application,
this workshop has taken the first and bold step in looking into the
question which Clemens Jochum asked in his October 1986 letter of
invitation to all the workshop attendees -"Is it possible to fill
the data gaps in the millions of Small Information Compounds in the
Beilstein database?". As we hear the many excellent research
activities described in lectures by experts in their fields over
the next three days, I hope that we will all remember to ask some
of the questions and address some of the issues mentioned in this
presentation, so that the goal of the Beilstein Institute can be
reached.
Acknowledgements
I would like to thank my colleagues at ARS, D. Bigwood, S. Rawlins,
D. Wauchope, and C. Helling for their valuable suggestions. I
would also like to thank D. Lide and L. Gevantman (NBS) and G. W.
A. Milne (NIH) for their insightful comments and thoughts on
numeric data, data evaluation, and data quality. Lastly, I would
like to thank my son Matt for his contribution of suggesting the
Mendeleev prediction of new elements while we were studying for one
of his high-school chemistry tests.
REFERENCES
1. Dr. Pangloss in Chapter 1 of Candide, Voltaire (1759).
2. Y. Wolman, "Chemical Information - A Practical Guide to
Utilization", 2nd ed., J. Wiley & Sons, New York (1988).
3. D. Mendeleev, Ann., Suppl. VIII, 133-229 (1871).
4. A. L. Horvath, "Halogenated Hydrocarbons", M. Dekker, New York
(1982).
5. W. J. Lyman, W. F. Reehl, and D. Rosenblatt, "Handbook of
Chemical Property Estimation Methods" McGraw-Hill, New York (1977).
This book is, at present, out of print.
6. PC-GEMS (Personal Computer version of the Graphical Exposure
Modeling Program) is available from Ms. Cathy Turner, US
Environmental Protection Agency, TS-798, Washington, DC 20460 USA.
The software will be provided free of charge so long as one sends
a sufficient number of formatted 360K or 1.2 MB 51/4 inch floppy
disks. The 1986 manual (Publication # SGC-TR-13-88-003) is also
available at no charge.
7. The IBM PC version of CHEMEST is available for $585 from TDS
(Technical Database Services) Inc., 10 Columbus Circle, New York,
NY 10019 (Telex: 6714962).
8. D. Gustafson, Monsanto Chemical Co., St. Louis, MO 63198,
private communication.
9. MedChem Project, Chemistry Department, Pomona College,
Claremont, CA 91711.
10. P. J. Hansen and P. C. Jurs, Anal. Chem., 59, 2322-2327 (1987).
11. R. C. Reid, J. M. Prausnitz, and T. K. Sherwood, "Properties of
Liquids and Gases" 3rd ed., McGraw-Hill, New York (1977).
12. D. W. Bigwood, S. R. Heller, W. R. Wolf, A. Schubert, and J. M.
Holden, Anal. Chim. Acta, 200, 411-419 (1987).
Figure 1: Mendeleev 1871 Property Prediction of Three Elements
Prediction Determination
---------- -------------
Eka* - Aluminum Gallium
(Discovered in 1875)
Atomic Weight 68 69.9
Specific Weight 6.0 5.96
Atomic Volume 11.5 11.7
Eka - Boron Scandium
(Discovered in 1879)
Atomic Weight 44 43.79
Oxide Eb2O3 Sc2O3
Specific Weight
(Oxide) 3.5 3.864
Sulphate Eb2(SO4)3 Sc2(SO4)3
Eka - Silicon Germanium
(Discovered in 1886)
Atomic Weight 72 72.3
Specific Weight 5.5 5.469
Atomic Volume 13 13.2
Oxide EsO2 GeO2
Specific Weight
(Oxide) 4.7 4.703
Chloride EsCl4 GeCl4
Boiling Point -
Chloride < 100 oC 86 oC
Density -
Chloride 1.9 1.887
Ethyl Compound EsAe4 Ge(C2H5O)4
Boiling Point 160 oC 160 oC
(Ethyl Compound)
* Eka is the Sanskrit prefix for the number one
Figure 2: Examples of Data Elements from the
Beilstein Factual Database
Mnemonic Name
ATC Atom Count
BF Biological Function
BP Boiling Point
BRN Beilstein Registry Number
CCOL Crystal Color
CDEN Crystal Density
CN Chemical Name
CRP Critical Pressure
CRT Critical Temperature
CRV Critical Volume
DEN Density
DM Dipole Moment
ECOL Ecological Data
ED Entry Data
ELC Element Count
ELS Element Symbol
ENTR Entropy
FW Formula (Molecular) Weight
HFOR Energy of Formation
HFUS Enthalpy of Fusion
HSUB Enthalpy of Sublimation
IP Ionization Potential
IRS Infrared (IR) Spectrum
LW Lawson (Classification Scheme) Number
MF Molecular Formula
MI Moment of Inertia
MP Melting Point
MS Mass Spectrum
NMRS NMR Spectrum
OA Optical Anisotropy
ORD Optical Rotary Dispersion
PHWP Polarographic Half-Wave Potential
PRE Preparation
QM Quadrupole Moment
RAS Raman Spectrum
REA Reaction
RN CAS Registry Number
SFOR Entropy of Formation
SLB Solubility
SO Beilstein Handbook Source Citation
ST Surface Tension
SY Synonym
TOX Toxicity
TP Triple Point
UP Update Date
USE Use
VP Vapor Pressure
Figure 3: Sample Property Estimation From PCGEMS Program
21:40:01 Sunday 05/01/88
PHYSICO-CHEMICAL PROPERTIES
---------------------------
Smiles Notation = CCCO
Chemical Name = Propanol
Molecular Formula = C3H8O Calc. from Smiles
Molecular Weight = 60.10 Calc. from Smiles
Physical State * = Liquid User Entered
LogKow = 2.9 -01 User Entered
Water Solubility = 8.59E+04 mg/L Equation 13N
Melting Point = -8.5E+01 (C) Grain and Lyman
Vapor Pressure = 48.33 mm at 25.00(C) Antoine
Boiling Point = 82.33 (C) Meissner
Henry's Law Constant = 4.89E-05 atm m3/mol Method 1
Bio Concentration Factor = 9.78E-01 Kow (Method 1)
Adsorption Coefficient = 1.00 Kow, Eqn. 4-10
* Estimated MP or BP does not change entered physical state.
Press any key to continue
(The output shown is exactly as it appeared on the computer monitor.)
Figure 4 - Sample Property Estimation From CHEMEST Program
*********************************************************
* *
* CHEMEST ............ CHEMICAL PROPERTY ESTIMATION *
* *
* FILE: ITALY.TST DATE: 1-May-88 TIME: 21:44:28 *
* *
*********************************************************
CHEMICAL NAME/IDENTIFICATION ... Propanol
============================
WATER SOLUBILITY ESTIMATION: --------------------------- SOLUBILITY : 8.59E+04 MG/L
ESTIMATION ERROR: ---------------- METHOD ERROR : X 1.6 PROPAGATED ERROR : X 1.0
TOTAL ERROR : X 1.6
METHOD IDENTIFICATION: ---------------------
METHOD USED : 1
EQUATION USED : 13 in Reference 15
KEY INPUT: -------
ACID GROUP IN CMPD.? : NO OCTANOL-WATER PRT. CF. : 0.290 L PHYSICAL STATE AT 25 C : L
(The output shown is exactly as it appeared on the computer monitor.)