Questions and Issues about the Process of Estimating Properties of Chemicals

Stephen R. Heller
USDA, ARS, BARC-W
Bldg. 007, Room 56
Beltsville, MD 20705-2350 USA
(301-344-1709)
Telex: 258 594 MDCL.UR
Telemail: SRHELLER
BITNET: SRHELLER@UMDARS

Abstract

Questions and issues about property prediction are addressed and discussed. Issues such as reliability, evaluation of predicted values and how to handle different/multiple values predicted from different methodology, and how the data should be presented to the user community are critical to a project of the scale of the Beilstein database of chemicals. Lastly a proposal is presented as to how to initially address large-scale prediction of properties of chemicals.

Introduction

In the "best of possible worlds" (1) a chemist who discovers a new compound would analyze its structure unambiguously and obtain accurate data on at least 50 of its most important chemical and physical properties. We are here in Bolzano because we do not live in the best of possible worlds. The Beilstein Institute exists because, in spite of Candide's optimism, this is not the best of all possible worlds, scientifically or otherwise. The best reason I can give for the lack of published property data about a compound is that most chemical compounds (in fact some 75%) are only reported once in the scientific literature (2). Some 15% are reported twice, which leaves only 10% of the entire known universe of chemicals for which there are more than two literature citations. Thus, it is quite clear that since one probably couldn't even readily obtain an authentic sample of most chemicals in the Beilstein database, the issue of the very costly and time-consuming efforts to experimentally obtain the data desired is really irrelevant. Hence, the very clear reason for this workshop is to find reliable, consistent, and easy to use methods for chemical and physical property prediction.

Property predication is not new. One of the first and probably the best known chemical property predictions was made some 117 years ago. It is quite fitting that the home of Professor Friedrich Konrad Beilstein, St. Petersburg, was also the place which gave rise to property prediction, the subject of this Beilstein Workshop. In 1871, Dmitri Ivanovitch Mendeleev published his prediction (Figure 1) of the existence and properties of Eka-aluminum, Eka-boron, and Eka-silicon (3). Within 15 years, these elements and their properties were discovered and the predictions of Mendeleev were shown to be rather accurate. No doubt those were simpler times and the task the Beilstein Institute wishes to undertake is of much greater complexity than the elements in the periodic table.

Background

The Beilstein Institute factual database now being put into computer-readable form contains some 400 parameters, some of which are shown in Figure 2. As we all know, there are three stages or steps in the creation of a complete database of numerical data. These are:

1. Collect Experimental Data

2. Evaluate Data

3. Fill in Data Gaps

For some 100 years, the Beilstein Handbook activities have been involved with the first two of these stages, that is the collection and extraction from the scientific literature and the evaluation of the data. Now as one looks at the Beilstein Handbook of over 350 volumes the question has arisen, what about the "data gaps". Can something be done to fill in the blanks? To put some perspective on the magnitude of the "data gaps" in one particular area, solubility, it is worthwhile to mention what Horvath (4) said in his book on hydrogenated hydrocarbons:

"Despite the great demand for solubility data by scientists and engineers, experimental values published in the open literature are very limited. Regarding the availability of solubility data for halogenated hydrocarbons in water, Beilstein (4th Supplementary Series, Volume 1/Part 1 (1958), Volume V/Part 1 (1963), and Volume V/Part 2 (1964)) cites 1369 compounds up to six carbon atoms, of which only 61 have information as to their solubility, mostly for a single temperature only."

In some cases methods have been developed for the prediction of a property. In other cases there has been no reported research or activity for predicting a particular property. Prediction should not be confused with prioritization. As Bill Milne at the NIH Cancer Institute will discuss in a later talk, it is often very useful to know the relative values of a given property. While this is definitely property estimation, it is not the sort of estimated values which would fill the "data gaps" in the Beilstein database. Also, in this matter of relative values of a given property, it could be harmful if non-experts see relative numbers and mistakenly take these values to be real and absolute numbers.

While predictive methods or procedures can be developed without data they can only be tested if one has data. Such data must also have several attributes. First, they must be accurate and precise to test a given hypothesis on how to predict a numeric data value. Second, there needs to be a reasonable number of data points so that there is some "weight of evidence" to the prediction. While two points will mathematically give a straight line, in chemistry (as well as other disciplines) more data are required before the scientific community feels the predominance of evidence is correct. Third, the data should cover a broad range of chemical classes to have maximum any predictive value. In preparing this presentation I searched the literature from 1967 to early 1988, using the Online Chemical Abstracts database, for property predictions. I also talked with some colleagues about this subject. If this literature search and these discussions were all one knew about organic chemistry it would seem that the field was concerned only with hydrocarbons and simple mono-functional groups compounds with fewer than 10-15 non-hydrogen atoms. Most predictive methods are useful as teaching examples, rather than being able to fill in the "data gaps" in the Beilstein Handbook or elsewhere. My research is concerned with the creation of a database of chemical, physical, and other properties of pesticides for use in models which will predict possible contamination of our nation's groundwater. With few exceptions (methyl bromide and 1,2-dibromoethane (EDB) being the only two I have found so far) most pesticides contain over 10 non-hydrogen atoms and 2-3 elements in addition to carbon and hydrogen. Thus, for this research to proceed new or expanded predictive capabilities are needed. Before this can occur, at least in the pesticide chemistry field, one needs better data for the existing compounds. One should not try to extrapolate from a vacuum into the real world.

How does one choose which method for property prediction to use? Are the known methods valid for the entire range of organic chemistry? Of course not. Have the authors explicitly stated what the limitations of the methods are? What is the reliability of a given method? What are the error ranges for the predicted data? Scientists are notorious for reporting calculated numbers to beyond the range of significant figures. What is necessary to be sure predicted data are properly presented? How was the method developed? If the method is based on calculation of another piece of data for input, what is the reliability and error range associated with the input data?

One of the first and most important properties being collected for the Agricultural Research Service (ARS) database is the aqueous solubility of a pesticide. Highly soluble materials are rapidly distributed in the soil and can be easily transported to the water table. There are a number of methods for estimating solubility. Five basic methods are described by Lyman and his colleagues (5) Chapter 2, Table 2-1. However most give an estimated value at only one temperature (25 ^oC), and "few have actually been presented (and tested) as predictive tools" (5, page 2-1). Furthermore, issues such as "relative merits, applicability, and accuracy" of these methods had not been reviewed prior to the work of Lyman and his colleagues (5, page 2-2). For example, the PC-GEMS program (6) first calculates an octanol/water partition coefficient (LogP) from a two dimensional structure input. It then calculates a melting point from the two dimensional structure and the calculated LogP value. From there the water solubility is calculated. As seen in Figure 3, the predicted water solubility is given to two significant figures, without any justification that the prediction is that accurate. Another estimation program, CHEMEST (7), which is based on work published in the book by Lyman and his colleagues (5), provides for calculations of 11 different properties using 36 different methods. This system, CHEMEST, fares much better with respect to providing information about the limits and accuracy of the methods and the calculation or estimation errors. Figure 4 shows the same calculation of solubility using the CHEMEST program. Both programs use the same procedures, but CHEMEST provides the user with information on the error associated with the method.

Recently I was told (8) of a chemical company which decided to re-run some partition coefficient data for several acid anilide pesticides. The chemicals were from their own company, as well as from other manufacturers. They designed and ran the experiments very carefully, with the proper quality assurance and quality control. When they compared their results to the predicted values using the CLOGP program (9), there were sufficient differences to warrant some concern. As a result, they discussed the matter with the world's authority in the field, Professor Corwin Hansch. After seeing how the experimental data were collected and what values the CLOGP predictive program generated, the CLOGP program was revised by Hansch and his coworkers to take into account high quality experimental data.

How many and what type of compounds were used to test the validity of the method? How accurate were the data used to develop and/or test the predictive technique? How does one prove that the predictive method used is accurate? In a talk at this workshop Peter Jurs from Penn State University (USA) will describe some of his research activities in this field, including a recently reported study (10) on predicting olefin boiling points from molecular structure. Jurs wisely chose these compounds because there were a reasonably large number of compounds (123) available and the data were of high quality. The method he has developed for the class of compounds studied appears to have solved the problems encountered with earlier prediction techniques (7,11) which were not able to handle many isomeric compounds. In contrast, the data in the Beilstein Handbook, while evaluated, are collected in a random fashion in terms of a class or series of compounds. Also, with rare exceptions the data come from many sources, published over many years, and using many different experimental conditions. Thus, it may not be possible to obtain a large number of similar compounds with enough identical properties from the Beilstein database to assure that a predictive method is properly tested and evaluated. Unquestionably this is a handicap which must be overcome. Without sufficient and good experimental data, predictive methods must be viewed very carefully.

Assuming satisfactory answers to these questions, let us now proceed to the question of data quality or reliability. How should the predicted results be evaluated? Since all the experimental data which go into the Beilstein Handbook are evaluated, it is reasonable to assume that methods must be developed to evaluate the predicted data. What does one do in the case where two (or more) methods are believed to be scientifically valid, yet yield different answers? We are beginning to develop a series of expert systems for data evaluation. The process will be based on our SELEX expert system (12) which provides objective and consistent evaluations of published data on the selenium content in foods. Our first data property expert systems will be in the areas of solubility and vapor pressure evaluations.

Once agreement is reached that a number for inclusion in the Beilstein database, how will be it noted or tagged in the database? Will it be clear to the user that the information or numeric value is not an experimental value, but rather in a predicted or calculated value? Certainly a clearly marked reference citation should suffice, but how can one be sure that an entry transferred from the Beilstein database to a report (or a value from the ARS Pesticide Property database used as input for some model or other purpose) is properly referenced and properly used? Should experimental data be in a separate section of the Beilstein database to help assure the user notices the difference? Should there be a notation in the record saying, for example, "No experimental data, please see predicted value given below"?

Should interpolated data be noted as such, as compared with extrapolated data? How will the evaluation criteria take such differences into account? What will happen when an experimental value for a particular parameter is found? Will the experimental value automatically replace the predicted value? What if the values are far apart? In some cases errors can be a few percent or less (for interpolation), but can orders of magnitude (for extrapolation). What might this imply about the method used for the prediction? Will there be a notation in the record that the newer, experimental value is a replacement value? Should the original predicted value be kept in the database?

When an experimental value is found to be considerably different from the predicted value, what should be done about the predicted values for other, similar compounds in the database (or the other compounds in the database which have data values predicted by this method)? If the reliability of a method comes into question later, how easy will it be to change all the records in the database found in a number of online systems which use this method?

One should also ask if the method is automated. If not, can it be automated? For a method to have any possible practical application for the Beilstein Institute, and be used with such a large database as the Beilstein Structure Registry Connection Table database, computerization is essential. Is the two-dimensional structure sufficient for input into the prediction method?

Responsibility

Who is responsible for the predicted data? When an error is found, should it or must it be quickly corrected? Certainly a computer program can regenerate a large set of predicted values in a very short time. Can this be done as a practical matter and will this be done even if it is costly? If it is done, what guarantee is there that the online vendors of the Beilstein database will quickly replace the older or incorrect data with the corrected or new data? Dealing with scientific data implies a greater responsibility than is normally taken for with bibliographic abstracts.

How will the scientific community accept a database with many entries of predicted values? How will Government agencies, in the US, Europe, and elsewhere accept such data? What will be the effect of such data have on patents and patent rights? Will predictions be considered, under any conditions, as "prior art"?

Proposal

Now that the less positive aspects of property prediction have been raised, I would like to propose some possibilities for future work. The research falls into two distinct areas. The first is creation of collections of high quality databases for a series of class of multi-functional group compounds. This is essentially what we are doing with the ARS Pesticide Database, since accurate values for parameters such as solubility and vapor pressure do not exist for most pesticides. Results from solubility experiments run under conditions of different temperature, pH, and ionic strength will give us the necessary data input for the wide range of agricultural conditions which exist. We then hope to use these results as the foundation for developing accurate predictive methods. In other areas, such as bio-medicine and pharmaceutical chemistry a parameter like the octanol-water partition coefficient (LogP) may be considered to be of high priority and importance. LogP also has been proposed to be of potential value in the prediction of aqueous solubility (4,9).

I would hope that the Beilstein Institute, with the support of the German Government and others, would fund several such projects as prototypes to see how useful some high quality data for a number of parameters will be in creating as broad a predictive strategy as possible.

Conclusion

This presentation discusses several difficult issues associated with the wide-scale use of predicted property data for a large and chemically diverse database. While the overall state-of-the-art is in its infancy, and is quite limited in its current application, this workshop has taken the first and bold step in looking into the question which Clemens Jochum asked in his October 1986 letter of invitation to all the workshop attendees -"Is it possible to fill the data gaps in the millions of Small Information Compounds in the Beilstein database?". As we hear the many excellent research activities described in lectures by experts in their fields over the next three days, I hope that we will all remember to ask some of the questions and address some of the issues mentioned in this presentation, so that the goal of the Beilstein Institute can be reached.

Acknowledgements

I would like to thank my colleagues at ARS, D. Bigwood, S. Rawlins, D. Wauchope, and C. Helling for their valuable suggestions. I would also like to thank D. Lide and L. Gevantman (NBS) and G. W. A. Milne (NIH) for their insightful comments and thoughts on numeric data, data evaluation, and data quality. Lastly, I would like to thank my son Matt for his contribution of suggesting the Mendeleev prediction of new elements while we were studying for one of his high-school chemistry tests.

REFERENCES

1. Dr. Pangloss in Chapter 1 of Candide, Voltaire (1759).

2. Y. Wolman, "Chemical Information - A Practical Guide to Utilization", 2nd ed., J. Wiley & Sons, New York (1988).

3. D. Mendeleev, Ann., Suppl. VIII, 133-229 (1871).

4. A. L. Horvath, "Halogenated Hydrocarbons", M. Dekker, New York (1982).

5. W. J. Lyman, W. F. Reehl, and D. Rosenblatt, "Handbook of Chemical Property Estimation Methods" McGraw-Hill, New York (1977). This book is, at present, out of print.

6. PC-GEMS (Personal Computer version of the Graphical Exposure Modeling Program) is available from Ms. Cathy Turner, US Environmental Protection Agency, TS-798, Washington, DC 20460 USA. The software will be provided free of charge so long as one sends a sufficient number of formatted 360K or 1.2 MB 51/4 inch floppy disks. The 1986 manual (Publication # SGC-TR-13-88-003) is also available at no charge.

7. The IBM PC version of CHEMEST is available for $585 from TDS (Technical Database Services) Inc., 10 Columbus Circle, New York, NY 10019 (Telex: 6714962).

8. D. Gustafson, Monsanto Chemical Co., St. Louis, MO 63198, private communication.

9. MedChem Project, Chemistry Department, Pomona College, Claremont, CA 91711.

10. P. J. Hansen and P. C. Jurs, Anal. Chem., 59, 2322-2327 (1987).

11. R. C. Reid, J. M. Prausnitz, and T. K. Sherwood, "Properties of Liquids and Gases" 3rd ed., McGraw-Hill, New York (1977).

12. D. W. Bigwood, S. R. Heller, W. R. Wolf, A. Schubert, and J. M. Holden, Anal. Chim. Acta, 200, 411-419 (1987).

Figure 1: Mendeleev 1871 Property Prediction of Three Elements

Prediction Determination

---------- -------------

Eka* - Aluminum Gallium

(Discovered in 1875)

Atomic Weight 68 69.9

Specific Weight 6.0 5.96

Atomic Volume 11.5 11.7

Eka - Boron Scandium

(Discovered in 1879)

Atomic Weight 44 43.79

Oxide Eb₂O₃ Sc₂O₃

Specific Weight

(Oxide) 3.5 3.864

Sulphate Eb₂(SO₄)₃ Sc₂(SO₄)₃

Eka - Silicon Germanium

(Discovered in 1886)

Atomic Weight 72 72.3

Specific Weight 5.5 5.469

Atomic Volume 13 13.2

Oxide EsO₂ GeO₂

Specific Weight

(Oxide) 4.7 4.703

Chloride EsCl₄ GeCl₄

Boiling Point -

Chloride < 100 ^oC 86 ^oC

Density -

Chloride 1.9 1.887

Ethyl Compound EsAe₄ Ge(C₂H₅O)₄

Boiling Point 160 ^oC 160 ^oC

(Ethyl Compound)

* Eka is the Sanskrit prefix for the number one

Figure 2: Examples of Data Elements from the

Beilstein Factual Database

Mnemonic Name

ATC Atom Count

BF Biological Function

BP Boiling Point

BRN Beilstein Registry Number

CCOL Crystal Color

CDEN Crystal Density

CN Chemical Name

CRP Critical Pressure

CRT Critical Temperature

CRV Critical Volume

DEN Density

DM Dipole Moment

ECOL Ecological Data

ED Entry Data

ELC Element Count

ELS Element Symbol

ENTR Entropy

FW Formula (Molecular) Weight

HFOR Energy of Formation

HFUS Enthalpy of Fusion

HSUB Enthalpy of Sublimation

IP Ionization Potential

IRS Infrared (IR) Spectrum

LW Lawson (Classification Scheme) Number

MF Molecular Formula

MI Moment of Inertia

MP Melting Point

MS Mass Spectrum

NMRS NMR Spectrum

OA Optical Anisotropy

ORD Optical Rotary Dispersion

PHWP Polarographic Half-Wave Potential

PRE Preparation

QM Quadrupole Moment

RAS Raman Spectrum

REA Reaction

RN CAS Registry Number

SFOR Entropy of Formation

SLB Solubility

SO Beilstein Handbook Source Citation

ST Surface Tension

SY Synonym

TOX Toxicity

TP Triple Point

UP Update Date

USE Use

VP Vapor Pressure

Figure 3: Sample Property Estimation From PCGEMS Program

21:40:01 Sunday 05/01/88

PHYSICO-CHEMICAL PROPERTIES

---------------------------

Smiles Notation = CCCO

Chemical Name = Propanol

Molecular Formula = C3H8O Calc. from Smiles

Molecular Weight = 60.10 Calc. from Smiles

Physical State * = Liquid User Entered

LogKow = 2.9 -01 User Entered

Water Solubility = 8.59E+04 mg/L Equation 13N

Melting Point = -8.5E+01 (C) Grain and Lyman

Vapor Pressure = 48.33 mm at 25.00(C) Antoine

Boiling Point = 82.33 (C) Meissner

Henry's Law Constant = 4.89E-05 atm m3/mol Method 1

Bio Concentration Factor = 9.78E-01 Kow (Method 1)

Adsorption Coefficient = 1.00 Kow, Eqn. 4-10

* Estimated MP or BP does not change entered physical state.

Press any key to continue

(The output shown is exactly as it appeared on the computer monitor.)

Figure 4 - Sample Property Estimation From CHEMEST Program

*********************************************************

* *

* CHEMEST ............ CHEMICAL PROPERTY ESTIMATION *

* *

* FILE: ITALY.TST DATE: 1-May-88 TIME: 21:44:28 *

* *

*********************************************************

CHEMICAL NAME/IDENTIFICATION ... Propanol ============================

WATER SOLUBILITY ESTIMATION: --------------------------- SOLUBILITY : 8.59E+04 MG/L

ESTIMATION ERROR: ---------------- METHOD ERROR : X 1.6 PROPAGATED ERROR : X 1.0

TOTAL ERROR : X 1.6

METHOD IDENTIFICATION: ---------------------

METHOD USED : 1 EQUATION USED : 13 in Reference 15

KEY INPUT: -------

ACID GROUP IN CMPD.? : NO OCTANOL-WATER PRT. CF. : 0.290 L PHYSICAL STATE AT 25 C : L

(The output shown is exactly as it appeared on the computer monitor.)