Notes
Slide Show
Outline
1
Chemical Information – 30 Years of Progress
  • Stephen R. Heller
  • Consultant/Guest Researcher
  • NIST/PCPD
  • Gaithersburg, MD 20899


  • steve@hellers.com
2
The slides from this presentation can be found at:

http://www.hellers.com/steve/pub-talks/
3
A few caveats ….
4
No animals were harmed in the
preparation of this talk,
however a number of WWW sites were hit.
5
I am very well balanced. I have a chip on both shoulders.
6
The examples shown here are arbitrary and are by no means a complete picture of what is available in the area of chemical information.
7
Three main factors that will mentioned when discussing software and databases.
  • Technical
  • Economic
  • Political/Cultural
8
Technical Issues/Factors
  • Development of information, databases, and data structures (i.e., formats).


  • What information to put into a database and in what form.





9
Economic Issues/Factors
  • What will it cost. Who will pay what to recover costs and make a profit.


  • Is it a profit making company, non-profit, professional society?


  • Run an organization as a business or in a business like manner.


  • The US Government decides on winning products. (e.g., mass spec database)


10
Political/Cultural Issues/Factors
  • Does an organization provide what the user wants and needs? What will provide the maximum income now and into the future for the database producer?


  • E.g. CAS Registry numbers are available in so few database for political and economic reasons.


  • The US Government decides on winning products. (e.g., mass spec database).


11
Lecture Outline
  • Computer Hardware/Software
  • Publications – Bibliographic Information – Journals, Books, Patents
  • Numeric Data
  • Chemical Structures
  • Internet Resources
  • Future Needs
12



There are three ways to ruin yourself:
gambling, women, and technology.
  • Gambling is the fastest
  • Woman are the most pleasurable
  • Technology is the most certain


  • George Pompidou


13
Some observations about hardware and software over the past 30 years.
14
If software is so user friendly, why are there so many training classes?
15
    Software is like entropy:

It is difficult to grasp, weighs nothing, and obeys the Second Law of Thermodynamics (it continues to increase).
16
                       Computer Trends

Hardware:
Smaller
Faster
Cheaper

Software:
Bigger
Slower
More Expensive
17
A common feature of all such systems is that they are different.
18
With my new PC I can :
  • 1. Edit text like my secretary
  • 2. Calculate like our bookkeeper
  • 3. File like the office clerk
  • 4. Communicate like the mail room
  • 5. Draw like the art department
19
Publications
20
Journals were mainly published by professional societies who originally had the members needs come first.

Today virtually all chemistry publishers are businesses. The needs of “rich” members come first.
21
ACS/CAS Salaries
  • Name                      1993                    2001
  • John Crum           $235,640           $676,840
  • Bob Massie          $212,500           $590,916
  • Bob Marks            $156,488
  • Bob Bovenschulte                          $482,877
  • Source: IRS 990 tax returns ---à www.idontcare.com/acs



22

                    1778

Chemisches Journal, thought to be the first chemical journal, is established by Lorenz von Crell. Published 1778-84, subsequently renamed Chemische Annalen and published 1784-1803. It already included some abstracts.
23
Journals
  • 1960
  • All print journals
  • Since the 18th century the only way to distribute information to chemists around the world


  • 2003
  • Most chemistry journals available electronically
  • Print/snail mail – replaced by the Internet


24
Most scientific manuscripts are write only –

so while you get what you need you still pay for everything
25
1960’s

CAS realizes it will take 6 years to manually produce the CAS 5 Year Index.  The need to keep track of chemical name index entries is the basis for the CAS Registry System.  The confusion with a chemical having multiple names is the basis for the CAS Structure File. In the late 1960's CAS had no idea what to do with the structure file, then 1.6 million compounds, and gave the file to NIH/DCRT to see what a research computer center might do with the database.  This lead to the NIH Substructure Search System, and was the first and last time CAS gave the file away
26
1960’s

The chemist would read the CAS sections appropriate to their research needs.  Then he/she would go to the library to read the full journal article of interest.  Often this meant a request for an interlibrary loan to obtain the article.
27
2003

The chemist logs onto CAS/SciFinder, ISI Web of Science, ScienceDirect,  Scirus.com, Chemindustry.com, or Chemweb.com to search for something of interest.  Then he/she clicks in the hyperlink, using LitLink or ChemPort and, assuming you have a paid for access to the journal article, the article appears immediately on your computer screen for you to read or print out and take to the bathroom to read. Now document delivery is easy and fast.  More importantly, one learns from the experiences of others - being able to do computer searches of the literature helps a lot and allows one to read more articles of interest.
28
In the past 30 years chemists have become more efficient owing to easier and more complete literature and patent searching.  Rapid/instant document delivery has made information much more accessible, assuming you can afford it. An e-journal is a more more valuable product!
29
Current Awareness will become
Continuous Awareness
30
The Paperless Office is as practical as the Paperless Bathroom –

The cost of printing a e-journal has moved to the user.
31
Patents
  • Patent offices around the world have moved to providing patent information in computer readable form in the past decade.


  • Companies, such as Derwent, CAS, Inpadoc, IFI, and Prous have added value to the raw patent information.


  • The US and European Patent Offices now provide electronic access.
32
Internet Resources
33
The Internet is like a box of chocolates -
you never know what you will get.

(with apologies to Forest Gump)
34
The internet is like drinking from a fire hydrant
35
Chemistry Portals
  • Functionalities:


  • Search Engine
  • Linking
  • Directories
36
Portals/Vortals
(Vertical Portals)
  • Openings – Doorways, Gateways, Outlet, Entry to other places


  • ChemGuide
  • ChemWeb
  • Chemindustry.com
  • Scirus
  • DiscoveryGate (fee based)


37
 
38
 
39
 
40
 
41
 
42
 
43
 
44
 
45
Chemical Structures
46
The nice thing about standards is
that there are so many of them.
47
Chemical Structure Representations
  • Wisswesser Line Notation (WLN)
  • CAS Registry III Connection Tables
  • CAS CXF (Chemical eXchange Format)
  • MDL MOLFile
  • Standard Molecular Data (SMD)
  • ROSDAL
  • SMILES
  • IUPAC IChI


48
 
49
 
50
What kind of Identifier is needed?
  • Exactly one Identifier per structure
  • Defined by algorithms


  • Comprehensive


  • Openly available


  • Implemented



51
REQUIREMENTS
 Different compounds have different identifiers
All distinguishing structural information is included
52
REQUIREMENTS

One compound has only one identifier
No unnecessary information is included
53
Divide into ‘Layers’
  • Separate ‘Name’ into Fragments by
    • Connectivity
    • Isotopes
    • Stereochemistry
    • Tautomerism

54
Possible Output Format
  • Example: Benzene
  •  <structure number="1" id.name="" id.value="">
  •  <identifier version="0.93Beta" tautomeric="0">
  •    <basic>C6H6,1H-2H-4H-6H-5H-3H-1</basic>
  •    <charge></charge>
  •   </identifier>


  •  <identifier.auxiliary-info version="0.93Beta" tautomeric="0">
  •    <!-- Auxiliary info is not a part of the identifier, it is not unique -->
  •    <atom.orig-nbr>1,2,6,3,5,4</atom.orig-nbr>
  •    <atom.equivalence>(1,2,3,4,5,6)</atom.equivalence>
  •   </identifier.auxiliary-info>
  •  </structure>


55
Future Extensions
  • Other Stereo Forms
    • Non-atom centered
    • Conformations
    • Hydrogen Bonding
  • Polymers/Macromolecules
  • Salts, Alloys
  • Organometallics
  • Mixtures
  • Compound Classes
    • Markush structures

56
IChI software available from
Steve Stein:

steve.stein@nist.gov
57
Numeric Data
58
Biology vs. Chemistry data – an interesting cultural difference
 
Bioinformatics data is the antithesis of the chemical data franchise.

Virtual all free vs. virtually all for fee.
59
Reference Works
  • 1970’s
  • Almost all in print
  • Outright, permanent purchase of work with one-time money
  • 2000’s
  • All all digital
  • Cannot buy anything, must lease year-to-year with yearly money
60
Numeric Databases
  • 1970
  • Very few databases in electronic form
  • Print-Libraries-Handbooks are the norm


  • 2003
  • No one would think of anything but an electronic database


61
Mass Spectrometry
  • 1970
  • First NIH database 8124 spectra - including duplicates
  • Very limited Quality Control


  • 2003
  • Current 2002 NIST database:
  • 174,948 total spectra
  • 147,198 unique spectra
  • Extensive Quality Control


62
Database Problems (1)
  • 1970’s
  • Too few CAS #’s and linking
  • Lack of coding/classification standards (different company names, therapeutic use of drugs, etc.)
  • 2000’s
  • Too few CAS #’s and linking
  • Lack of coding/classification standards (different company names, therapeutic use of drugs, etc.)




63
Database Problems (2)
  • 1970’s
  • Quality Control is time consuming, but can be affordable
  • Databases don’t have CAS#’s
  • 2000’s
  • Quality Control is time consuming, and seems less and less affordable
  • 10% of a database has superceded CAS #’s
64
Database Problems - NCI (1)
  • 1970’s
  • NCI drug database – 300,000 compounds
  • No quality control on structures – just assume all are correct
  • 2000’s
  • NCI drug database – 500,000 compounds
  • A promising compound leads to discovery of MANY, MANY bad structures
65
Database Problems –  NCI (2)
  • On one hand why does it matter if a structure is incorrect as long as it is an effective drug.


  • On the other hand if one wants to find a similar more effective, less toxic drug it would be nice to know what the correct structure is.
66
Multiple Database Access
  • 1970
  • NIH/EPA CIS
  • Structure Searching linked to data and information
  • 200,000 structures
  • ~ 1 million facts
  • Searches only CIS databases
  • Links are all “hard-wired” – same cpu


  • 2003
  • MDL DiscoveryGate
  • Structure searching link to data, literature, and information
  • 11 million structures
  • 200 million facts
  • Searches only MDL databases
  • Some links are now web based hyperlinks


67
 
68
Multiple Database Access - 2003
  • CIS – Not possible to search in-house and public databases; No cooperation with ACS/CAS. All databases on same computer.


  • DiscoveryGate - Not possible to search in-house and public databases; No cooperation with ACS/CAS. Databases (e.g., journal articles) hard-linked and hyper-linked.


69
CIS vs. DiscoveryGate – in 30 years the databases are bigger, much more extensive, and link to databases on other computer systems, but the functionality is just about the same.

PS. The CIS was not commercially viable and is no longer available, so size counts.  The success of DiscoveryGate, a newly released product, remains to be seen.
70
Vendor Revenues
  • Microsoft - $28.3 billion (company started in 1975)
  • Elsevier Science - $1.2 billion (excludes textbooks)
  • Kluwer – 700 million (entire health division)
  • ISI/Derwent - $250 million (started in 1958/1951)
  • CAS - $180 million (started in 1907)
  • MDL - $100 million (started in 1977)
  • Accelrys - $100 million (too many parts to date)
  • ACS Publications - $85 million (started in 1879)
  • Tripos - $50 million (started in 1979)
  • Daylight - $15 million (started in 1987)
  • CambridgeSoft - $10 million (started in 1983)
  • ACD/Labs – $10 million (started in 1994)


71
Future Needs
72
How well one creates, manages, distributes, and uses information is critical for the future of any organization
73
Successful information companies in the 2000's will
provide information:

When  it is needed
Where it is needed
How   it is needed
74
Where to get help
  • Chemtelligence Partners - consulting services in the strategic deployment and use
  • of information based tools in support of life sciences research. (www.chemtelligence.com)


  • Kilmorie - helps organizations to make the
  • best use of their information resources, develop new information services or products, and bring them to market.
  • (www.kilmorie.com)
75
In biotechnology the only -omics that really counts is economics


Sydney Brenner, 2002 Nobel Laureate
76
Pharma Industry Problem
  • Over the past 20 years the number of NME’s - New Molecular Entities or new drugs - has stayed level or decreased while the cost of developing a new drug has doubled and tripled.  All this occurred in spite of the advances in chemistry and chemical information. This is UNACEPTABLE!
77
 
78
 
79
Future Needs
  • Electronic Notebooks
  • Laboratory Information Management Systems (LIMS)
  • ADMET (accurate) Predictions
  • “Integration”
  • “Compatibility”
  • “Interoperability”
  • “Standardization of Information”
  • “Knowledge Management”
  • Knowledge not Information
  • More Linux, less Microsoft
80
Electronic Notebooks/LIMS
  • Allows for potential easy access to information and data – both text and graphics - for processing and/or analysis
  • Save times/money
  • Will be needed because regulatory organizations, such as the FDA (21 CFR  11) and US Patent Office say they want it


81
The most direct way of making such an attempt would obviously be to compare physiological action and chemical constitution in a sufficiently large number of cases, and by classifying the results to deduce a law; but unfortunately, “the data which we possess are quite are quite insufficient
for this.”


Brown & Fraser, Trans. Roy. Soc. Edinburgh, 25, 151 (1869)
82
ADMET (1)

  • The need to know the pharmacokinetic properties -Absorption, Distribution, Metabolism, Excretion, and Toxicity of a possible drug as early as possible (so that expensive clinical studies are not done if there is a problem) is a major concern to the pharma industry  – the need to “fail early”.
  • Despite rigorous in vitro and in vivo toxicity studies at the pre-clinical stage only one out of 10 NCEs (new chemical entities) survives three phases of clinical trials.



83
ADMET (2)
  • Will be needed because the FDA says so


  • ADMET-1 – San Diego, February 11-13, 2004 see: www.scherago.com/admet



84
For a talk to be immortal
it does not have to be eternal.

Muriel Humphrey comment to
Vice-President Hubert Humphrey
85
Summary (1)
  • Over the past 30 years the quantity of chemical information available has increased dramatically.  Quality is still an issue.  There is still not enough data. Content is still king.


  • Interoperability – standardization – integration - is desired by users, but disliked by producers.


  • Print information is a one-time purchase, while electronic information is generally available as lease-only.
86
Summary (2)
  • Searching has gone from slow (300 baud) speed searching by information professionals to high speed (Internet T3) searching by end-users.


  • Databases have gone from primarily text to value-added indexing, coding, structures, and linking.
87
Summary (3)
  • Two major areas of chemical information software and database development will be in the ADMET and E-notebooks/LIMS/     e-data integration.
88
If I haven't stepped on some toes, I'll try again.