|
1
|
- Stephen R. Heller
- Consultant/Guest Researcher
- NIST/PCPD
- Gaithersburg, MD 20899
- steve@hellers.com
|
|
2
|
|
|
3
|
|
|
4
|
|
|
5
|
|
|
6
|
|
|
7
|
- Technical
- Economic
- Political/Cultural
|
|
8
|
- Development of information, databases, and data structures (i.e.,
formats).
- What information to put into a database and in what form.
|
|
9
|
- What will it cost. Who will pay what to recover costs and make a profit.
- Is it a profit making company, non-profit, professional society?
- Run an organization as a business or in a business like manner.
- The US Government decides on winning products. (e.g., mass spec
database)
|
|
10
|
- Does an organization provide what the user wants and needs? What will
provide the maximum income now and into the future for the database
producer?
- E.g. CAS Registry numbers are available in so few database for political
and economic reasons.
- The US Government decides on winning products. (e.g., mass spec
database).
|
|
11
|
- Computer Hardware/Software
- Publications – Bibliographic Information – Journals, Books, Patents
- Numeric Data
- Chemical Structures
- Internet Resources
- Future Needs
|
|
12
|
- Gambling is the fastest
- Woman are the most pleasurable
- Technology is the most certain
- George Pompidou
|
|
13
|
|
|
14
|
|
|
15
|
|
|
16
|
|
|
17
|
|
|
18
|
- 1. Edit text like my secretary
- 2. Calculate like our bookkeeper
- 3. File like the office clerk
- 4. Communicate like the mail room
- 5. Draw like the art department
|
|
19
|
|
|
20
|
|
|
21
|
- Name
1993
2001
- John Crum $235,640 $676,840
- Bob Massie $212,500 $590,916
- Bob Marks $156,488
- Bob Bovenschulte
$482,877
- Source: IRS 990 tax returns ---à www.idontcare.com/acs
|
|
22
|
|
|
23
|
- 1960
- All print journals
- Since the 18th century the only way to distribute information
to chemists around the world
- 2003
- Most chemistry journals available electronically
- Print/snail mail – replaced by the Internet
|
|
24
|
|
|
25
|
|
|
26
|
|
|
27
|
|
|
28
|
|
|
29
|
|
|
30
|
|
|
31
|
- Patent offices around the world have moved to providing patent
information in computer readable form in the past decade.
- Companies, such as Derwent, CAS, Inpadoc, IFI, and Prous have added
value to the raw patent information.
- The US and European Patent Offices now provide electronic access.
|
|
32
|
|
|
33
|
|
|
34
|
|
|
35
|
- Functionalities:
- Search Engine
- Linking
- Directories
|
|
36
|
- Openings – Doorways, Gateways, Outlet, Entry to other places
- ChemGuide
- ChemWeb
- Chemindustry.com
- Scirus
- DiscoveryGate (fee based)
|
|
37
|
|
|
38
|
|
|
39
|
|
|
40
|
|
|
41
|
|
|
42
|
|
|
43
|
|
|
44
|
|
|
45
|
|
|
46
|
|
|
47
|
- Wisswesser Line Notation (WLN)
- CAS Registry III Connection Tables
- CAS CXF (Chemical eXchange Format)
- MDL MOLFile
- Standard Molecular Data (SMD)
- ROSDAL
- SMILES
- IUPAC IChI
|
|
48
|
|
|
49
|
|
|
50
|
- Exactly one Identifier per structure
- Defined by algorithms
- Comprehensive
- Openly available
- Implemented
|
|
51
|
|
|
52
|
|
|
53
|
- Separate ‘Name’ into Fragments by
- Connectivity
- Isotopes
- Stereochemistry
- Tautomerism
|
|
54
|
- Example: Benzene
- <structure
number="1" id.name="" id.value="">
- <identifier
version="0.93Beta" tautomeric="0">
-
<basic>C6H6,1H-2H-4H-6H-5H-3H-1</basic>
- <charge></charge>
- </identifier>
- <identifier.auxiliary-info
version="0.93Beta" tautomeric="0">
- <!-- Auxiliary info is not a
part of the identifier, it is not unique -->
-
<atom.orig-nbr>1,2,6,3,5,4</atom.orig-nbr>
-
<atom.equivalence>(1,2,3,4,5,6)</atom.equivalence>
-
</identifier.auxiliary-info>
- </structure>
|
|
55
|
- Other Stereo Forms
- Non-atom centered
- Conformations
- Hydrogen Bonding
- Polymers/Macromolecules
- Salts, Alloys
- Organometallics
- Mixtures
- Compound Classes
|
|
56
|
|
|
57
|
|
|
58
|
|
|
59
|
- 1970’s
- Almost all in print
- Outright, permanent purchase of work with one-time money
- 2000’s
- All all digital
- Cannot buy anything, must lease year-to-year with yearly money
|
|
60
|
- 1970
- Very few databases in electronic form
- Print-Libraries-Handbooks are the norm
- 2003
- No one would think of anything but an electronic database
|
|
61
|
- 1970
- First NIH database 8124 spectra - including duplicates
- Very limited Quality Control
- 2003
- Current 2002 NIST database:
- 174,948 total spectra
- 147,198 unique spectra
- Extensive Quality Control
|
|
62
|
- 1970’s
- Too few CAS #’s and linking
- Lack of coding/classification standards (different company names,
therapeutic use of drugs, etc.)
- 2000’s
- Too few CAS #’s and linking
- Lack of coding/classification standards (different company names,
therapeutic use of drugs, etc.)
|
|
63
|
- 1970’s
- Quality Control is time consuming, but can be affordable
- Databases don’t have CAS#’s
- 2000’s
- Quality Control is time consuming, and seems less and less affordable
- 10% of a database has superceded CAS #’s
|
|
64
|
- 1970’s
- NCI drug database – 300,000 compounds
- No quality control on structures – just assume all are correct
- 2000’s
- NCI drug database – 500,000 compounds
- A promising compound leads to discovery of MANY, MANY bad structures
|
|
65
|
- On one hand why does it matter if a structure is incorrect as long as it
is an effective drug.
- On the other hand if one wants to find a similar more effective, less
toxic drug it would be nice to know what the correct structure is.
|
|
66
|
- 1970
- NIH/EPA CIS
- Structure Searching linked to data and information
- 200,000 structures
- ~ 1 million facts
- Searches only CIS databases
- Links are all “hard-wired” – same cpu
- 2003
- MDL DiscoveryGate
- Structure searching link to data, literature, and information
- 11 million structures
- 200 million facts
- Searches only MDL databases
- Some links are now web based hyperlinks
|
|
67
|
|
|
68
|
- CIS – Not possible to search in-house and public databases; No
cooperation with ACS/CAS. All databases on same computer.
- DiscoveryGate - Not possible to search in-house and public databases; No
cooperation with ACS/CAS. Databases (e.g., journal articles) hard-linked
and hyper-linked.
|
|
69
|
|
|
70
|
- Microsoft - $28.3 billion (company started in 1975)
- Elsevier Science - $1.2 billion (excludes textbooks)
- Kluwer – 700 million (entire health division)
- ISI/Derwent - $250 million (started in 1958/1951)
- CAS - $180 million (started in 1907)
- MDL - $100 million (started in 1977)
- Accelrys - $100 million (too many parts to date)
- ACS Publications - $85 million (started in 1879)
- Tripos - $50 million (started in 1979)
- Daylight - $15 million (started in 1987)
- CambridgeSoft - $10 million (started in 1983)
- ACD/Labs – $10 million (started in 1994)
|
|
71
|
|
|
72
|
|
|
73
|
|
|
74
|
- Chemtelligence Partners - consulting services in the strategic
deployment and use
- of information based tools in support of life sciences research. (www.chemtelligence.com)
- Kilmorie - helps organizations to make the
- best use of their information resources, develop new information
services or products, and bring them to market.
- (www.kilmorie.com)
|
|
75
|
|
|
76
|
- Over the past 20 years the number of NME’s - New Molecular Entities or
new drugs - has stayed level or decreased while the cost of developing a
new drug has doubled and tripled.
All this occurred in spite of the advances in chemistry and
chemical information. This is UNACEPTABLE!
|
|
77
|
|
|
78
|
|
|
79
|
- Electronic Notebooks
- Laboratory Information Management Systems (LIMS)
- ADMET (accurate) Predictions
- “Integration”
- “Compatibility”
- “Interoperability”
- “Standardization of Information”
- “Knowledge Management”
- Knowledge not Information
- More Linux, less Microsoft
|
|
80
|
- Allows for potential easy access to information and data – both text and
graphics - for processing and/or analysis
- Save times/money
- Will be needed because regulatory organizations, such as the FDA (21
CFR 11) and US Patent Office say
they want it
|
|
81
|
|
|
82
|
- The need to know the pharmacokinetic properties -Absorption,
Distribution, Metabolism, Excretion, and Toxicity of a possible drug as
early as possible (so that expensive clinical studies are not done if
there is a problem) is a major concern to the pharma industry – the need to “fail early”.
- Despite rigorous in vitro and in vivo toxicity studies at the
pre-clinical stage only one out of 10 NCEs (new chemical entities)
survives three phases of clinical trials.
|
|
83
|
- Will be needed because the FDA says so
- ADMET-1 – San Diego, February 11-13, 2004 see: www.scherago.com/admet
|
|
84
|
|
|
85
|
- Over the past 30 years the quantity of chemical information available
has increased dramatically.
Quality is still an issue.
There is still not enough data. Content is still king.
- Interoperability – standardization – integration - is desired by users,
but disliked by producers.
- Print information is a one-time purchase, while electronic information
is generally available as lease-only.
|
|
86
|
- Searching has gone from slow (300 baud) speed searching by information
professionals to high speed (Internet T3) searching by end-users.
- Databases have gone from primarily text to value-added indexing, coding,
structures, and linking.
|
|
87
|
- Two major areas of chemical information software and database
development will be in the ADMET and E-notebooks/LIMS/ e-data integration.
|
|
88
|
|