|
1
|
|
|
2
|
|
|
3
|
|
|
4
|
|
|
5
|
- 2. Description/Contents of the Databases
- 3. Features of the Databases
- 4. InChI/InChIKey
|
|
6
|
|
|
7
|
- This lecture will cover a
number of large (arbitrarily defined as greater than 1 million
structures) chemical structure databases currently available, or soon to
be available, on the Internet. Just a few years ago there were only two
very large, but rather different databases of organic chemicals
available - Beilstein and Chemical Abstracts. There was also one large
database of chemical structures associated with chemical reactions
(SPRESI). Within the past 1-2
years the situation has dramatically changed with some 20 large
structure databases of all sorts becoming available, some commercial and
some no-fee/open data.
|
|
8
|
|
|
9
|
|
|
10
|
- Ambinter – 5.5 million
- BioRad - 1 million
- Chem DB UC/ Irvine - 5 million
- Chemical Abstracts - 31 million
- Chemisches Zentralblatt –1.5 million
- ChemNavigator - 25 million
- ChemSpider - 17 million
- Crossfire Beilstein -10 million
- Derwent Chemistry Resource – 1
million
- DiscoveryGate - 20 million
- eMolecules - 7 million
|
|
11
|
- Generated Database (GDB) Berne – 26 million
- GVK BIO – 1.4 million
- IBM Patent Database - 4 million
- Index Chemicus - 2.8 million
- NCI - 30 million
- QueryChem – Harvard - 10 million
- PubChem - 11 million
- Ryan Scientific – 1.9 million
- SPRESI - 6 million
- SureChem Patents – 10 million
- Thomson Pharma – 2.4 million
- ZINC – UCSF - 5 million
|
|
12
|
|
|
13
|
- Supplier of advanced chemicals – worldwide
- Paris, France
- http://www.ambinter.com/
- From their website:
- You can receive our last CDROM
(contact us to request a CD – a subset of the main database). The main
database has to be search by structure/substructure/similarity online
- use the Search from the home page.
|
|
14
|
|
|
15
|
- KnowItAll® U(niversity) is a unique spectroscopy resource for research
and teaching. KnowItAll U puts the largest single collection of spectra
(over 1.3 million IR, NMR, MS, Raman, UV-Vis, and Near IR) at the
fingertips of every student, faculty, and staff member in your
institution—at any computer, campus-wide. In addition, KnowItAll U
offers award-winning chemistry, spectroscopy, and chemometrics software.
|
|
16
|
- Links:
- Chemical Names
- SMILES
|
|
17
|
- ChemDB is a chemical database
containing some 5 million commercially available small molecules,
important for use as synthetic building blocks, probes in systems
biology and as leads for the discovery of drugs and other useful
compounds. The data is publicly available over the web for download and
for targeted searches using a variety of powerful methods. The chemical
data includes predicted or experimentally determined physicochemical
properties, such as 3D structure, melting temperature and solubility.. A
text-based search engine allows efficient searching of compounds based
on over 65 million annotations from over 150 vendors Built in reaction models enable
searches through virtual chemical space, consisting of hypothetical
products readily synthesizable from the building blocks in ChemDB.
- Availability: ChemDB and Supplementary Materials are available at http://cdb.ics.uci.edu
- Contact: pfbaldi@ics.uci.edu
|
|
18
|
- URL: http://cdb.ics.uci.edu/CHEM/Web/
- Sources of chemicals for the database:
- http://cdb.ics.uci.edu/CHEM/Web/cgibin/supplement/Implementation.py#source
- June 2007 manuscript:
- http://bioinformatics.oxfordjournals.org/cgi/reprint/23/17/2348?ijkey=swjzipsmJeGWWzS&keytype=ref
|
|
19
|
- Links:
- CAS RN (incomplete)
- Chemical Names
- SMILES
- InChI
|
|
20
|
- URL: http://www.cas.org
- For the past 100 years CAS has
indexed and summarized chemistry‑related articles from about 9,500 journals ( a number which has
been decreasing over the years),
as well as patents, conference proceedings, books, and other documents pertinent to
chemistry, life sciences and related areas. Since 1907 the database contains over
30 million abstracts and 32 million chemical structures.
|
|
21
|
- Links:
- CAS RN’s
- Chemical Names (the most numerous of all databases – both index, common
and trivial names)
|
|
22
|
- URL: None yet
- Chemisches Zentralblatt began
its life as Pharmaceutisches Centralblatt in 1830. Between 1830 and 1897
it underwent a number of changes in its title and publisher when it was
renamed for the final time as Chemisches Zentralblatt. As a result of WWII it stopped
publishing in 1945 but resumed a few years later. However it never recovered its pre-war
status and finally was terminated in1969 .
- It is currently be digitized
and structures are being extracted from the names in the ~ 1.8 million
abstracts.
|
|
23
|
|
|
24
|
- URL: www.chemnavigator.com
- The iResearch Library, created
and assembled by ChemNavigator,
is ChemNavigator's up‑to‑date compilation of
commercially accessible screening compounds from international chemistry
suppliers. The database currently tracks over 40 million chemical
samples from some 270 suppliers. The database contains some 21 million
unique structures.
|
|
25
|
- Links:
- CAS RN’s (incomplete)
- Chemical Names (incomplete)
- InChI
- SMILES
|
|
26
|
- URL: http://www.chemspider.com/
- ChemSpider is a chemistry
search engine. It has been built with the intention of aggregating and
indexing chemical structures and their associated information into a
single searchable repository and making it available to everybody, at no
charge. ChemSpider is a value-added offering since many properties have
been added to each of the chemical structures within the database –
structure identifiers such as SMILES, InChI, IUPAC and Index Names as
well as many physicochemical properties. We intend ChemSpider to offer
the fastest chemical structure searches available online and delivered
with the flexibility and usability necessary to encourage repeat usage.
|
|
27
|
- Links:
- Chemical Names
- InChI/InChIKey
- SMILES
|
|
28
|
- What problems will ChemSpider
solve?
- There are tens if not
hundreds of chemical structure databases and no single way to search
across them. There are databases of curated literature data, chemical
vendor catalogs, molecular properties, environmental data, toxicity
data, analytical data and on and on.
- The only way to know whether
a specific piece of information is available for a chemical structure is
to have simultaneous access to all of these databases. Since many of
these databases are for profit there is no way to easily determine the
availability of information within these commercial or even in the open
access databases. With ChemSpider the intention is to aggregate into a
single database all chemical structures available within open access and
commercial databases and to provide the necessary pointers from the
ChemSpider search engine to the information of interest. This service
will allow users to either access the data immediately via open access
links or have the information necessary to continue their searches into
commercially available systems. The question “is there specific
information about my chemical” will be answered. Accessing the
information may require a commercial transaction with the appropriate
provider.
|
|
29
|
|
|
30
|
|
|
31
|
|
|
32
|
- The current Crossfire Beilstein
database consists of somewhat over 10 million structures and 320 million
experimental pieces of data. The
Beilstein database provides chemical data on organic substances and
reactions, including structures, properties, bioactivity records,
preparation details and specific reaction pathways; also provides
citations and some abstracts to the primary organic chemistry
literature. Incorporates ALL of
the data from the original Beilstein Handbuch (1771-1984) and from
journals abstracted since 1980.
|
|
33
|
- Links:
- Beilstein numbers
- CAS RN’s (incomplete)
- Lawson numbers
- Chemical Names
- InChI/InChIKey
|
|
34
|
- Derwent Chemistry Resource
(Dialog File 355) lets you find specific chemical compounds within
Derwent World Patents Index (WPI) records. Unique numbers identify
specific chemical compounds and form the link between Derwent Chemistry
Resource and the corresponding bibliographic indexing in Derwent WPI.
|
|
35
|
|
|
36
|
- URL www.discoverygate.com
- DiscoveryGate® from Symyx
Technologies is a collection of a
number of databases (including the Crossfire Beilstein database) designed for scientific information
and answers to pharma/drug discovery questions. A web‑based
discovery environment, DiscoveryGate integrates, indexes, and links
scientific information to give the user immediate access to compounds
and related data, reactions, original journal articles and patents, and
authoritative reference works on synthetic methodologies - all from a
single entry point.
|
|
37
|
- Links:
- CAS RN (incomplete)
- Chemical Names
- InChI/InChIkey
|
|
38
|
- URL: www.emolecules.com/
- eMolecules ® describes itself
as the leading open-access chemistry search engine. eMolecules' mission
is to discover, curate and index all of the public chemical information
in the world, and make it available to the public for free. eMolecules
comprises primarily of chemical catalogs and other public databases,
such as PubChem. They have recently added spectral data from Wiley to
their system.
|
|
39
|
- Links
- CAS RN’s (incomplete)
- Chemical Names
- SMILES
|
|
40
|
- URL: http://dcbwww.unibe.ch/groups/reymond/
- GDB is a large (26 million
structures) database of generated structures, which the Reymond group
believes is of value for drug discovery.
The Reymond group has
taken such a first look by constructing a database of all
molecules up to 11 atoms under constraints that define chemical
stability and synthetic feasibility. The database contains 26.4 million
compounds, the vast majority of which have never been synthesized.
|
|
41
|
|
|
42
|
- Links:
- Chemical Names
- SMILES
|
|
43
|
- URL: http://www.gvkbio.com/informatics.html
- These databases are developed based on journal and patent information.
The information contains both chemical as well as biological space
pertaining to the reported molecules.
- These databases contain information on pharmacokinetics, toxicity and
clinical-relationship from various journal articles, patents, reviews,
clinical trials and all other possible sources, both public and private
in nature, updated periodically.
|
|
44
|
- Steve Boyer of IBM has taken a
copy of the computer readable version of the US Patent databases and
extracted over 7 million chemical names which he converted into a
searchable structure file.
Concept terms have also been tagged and links to the NLM PubMed
database have been made.
|
|
45
|
- Links:
- Chemical Names
- InChI/InChIKey
- SMILES
|
|
46
|
- Index Chemicus is a
text‑ and substructure searchable database of the structures from
the Thomson Web of Science (WOS)
reports, and adds over 200,000 new compounds each year, with a
total coverage of over 2.8
million unique structures published in the literature since the
early 1990’s. It covers the world's leading organic chemistry journals,
Index Chemicus offers full graphical summaries, important reaction
diagrams, complete bibliographic information, and author abstracts.
|
|
47
|
|
|
48
|
- (CSLS) is a new
web‑based system for locating chemical structures in over 70
different public and commercial data sources. The CSLS system stores
information on over 30 million chemical structures and provides a simple
search interface for looking up chemicals by specific structure as well
as by parent structure, and by various identifiers.
- The goal in creating CSLS was
to provide one publicly accessible system that cross‑references
multiple cheminformatics data sources based on chemical structure.
Scientists can use this system to find what information is available
about a specific chemical structure or a list of structures by quickly
identifying databases in which these structures occurs. The links are,
in general, not direct links, as most of the databases are fee-based and
not directly available.
|
|
49
|
- Links:
- CAS RN’s (incomplete)
- Chemical Names
- InChI/InChIKey
- SMILES
|
|
50
|
- URL: pubchem.ncbi.nlm.nih.gov/
- PubChem is a DEPOSITION system
that provides information on the biological activities of small
molecules. It is a component of NIH's Molecular Libraries Roadmap
Initiative. The easiest way to
learn more about how to use the PubChem resources is to go to their
expanding help page:
-
http://pubchem.ncbi.nlm.nih.gov/help.html
|
|
51
|
- Links:
- Chemical Names
- SMILES
|
|
52
|
- Query Chem (www.QueryChem.com)
is a Web program that integrates chemical structure and text-based
searching using publicly available chemical databases and Google's Web
Application Program Interface (API). QueryChem is just a combination of
the database from ChemBank, PubChem., and eMolecules. Query Chem makes
it possible to search the Web for information about chemical structures
without knowing their common names or identifiers. Furthermore, a
structure can be combined with textual query terms to further restrict
searches. Query Chem's search results can retrieve many interesting
structure-property relationships of biomolecules on the Web.
|
|
53
|
|
|
54
|
- Ryan Scientific specializes in the sales and marketing of chemicals
required by Biotechnology, Pharmaceutical, Agricultural research
companies and Universities throughout North America. Our products are
primarily focused in Drug and Ag discovery research and are used for
both in High Throughput Screening (HTS) and organic synthesis, using
combinatorial and structure-based techniques.
- Their combined catalog of chemicals come from over 100 suppliers:
- http://www.ryansci.com/adVend.htm
|
|
55
|
- SPRESI is a chemical structure
and reaction database that includes over 5 million structures, 3.7
million reactions and 28 million factual data entries extracted from
600,000 references and 164,000 patents. It was introduced in 2002 by
Infochem It includes Synthesis Tree Search which searches for published
synthesis reactions leading to and from the target. The SPRESIweb data have been
abstracted from over 1350 literature sources, mostly journals
|
|
56
|
|
|
57
|
- Search more than 10 million
chemical structures
- Complete full text collections
of US, European and WO/PCT patents
- Structures updated within days
of new patent issuance
- Advanced chemical structure and
patent search tools
- Export structure and text search
results
- Powerful result filtering and
query navigation tools
|
|
58
|
- Links:
- Chemical Names
- InChI/InChI/Key
- SMILES
|
|
59
|
- URL: http://www.thomson-pharma.com/
- The 2.4 million unique
structures in Thomson Pharma contains content from the other Thomson
databases, some of which have already been mentioned.
- Derwent Drug File – 123,000
- Derwent World Patent Index - 1.05 million
- ISI Index Chemicus - 2.8 million
- Current Chemical Reactions – 561,000
- BUT it is not the sum of all the others
due to the following:
- 1. The structures are de-duplicated across the sources
- 2. Only "pharmaceutically relevant" compounds are included,
e.g. only
- those from section B of DWPI are included (about 2/3). E.g. from IC,
- only those from a subset of journals or with biological activity are
- included (just under half the total).
And a few extra compounds,
e.g. from IDDB,CFT are also included
|
|
60
|
|
|
61
|
- URL: blaster.docking.org/zinc/
- ZINC is a free database of
commercially‑available compounds for virtual screening. ZINC
contains over 4.6 million compounds in ready‑to‑dock, 3D
formats. ZINC is provided by the Shoichet Laboratory in the Department
of Pharmaceutical Chemistry at the University of California, San
Francisco (UCSF). There was a descriptive write up on ZINC in C&E
news in 2005:
- http://pubs.acs.org/cen/news/83/i07/8307notw3.html
- Funded by NIH, and with the
agreement of numerous chemical supplier companies ZINC can be used with numerous docking
programs. Thus ZINC is effectively a Aready to dock@ database@. Shoichet and Irwin have produced three‑dimensional
structures from two‑dimensional information, weeded out insoluble
forms, and calculated properties
such as protonation states and number of rotatable bonds .
|
|
62
|
- Links:
- Chemical Names
- SMILES
|
|
63
|
- Category 1: (Literature and/or Patent Links)
- Beilstein
- Chemical Abstracts
- Chemisches Zentralblatt
- Derwent Chemistry Resource
- DiscoveryGate
- GVK BIO
- IBM Patent Database
- Index Chemicus
- PubChem
- SureChem
- SPRESI
|
|
64
|
- Category 2: (Chemical Catalogs/Information)
- Ambinter
- ChemDB
- ChemSpider
- ChemNavigator
- EMolecules
- GDB - Berne
- NCI
- QueryChem
- Ryan Scientific
|
|
65
|
- Category 3: (Data containing -
chemical reactions alone not being considered data)
- Beilstein
- DiscoveryGate
- E-molecules
- PubChem
|
|
66
|
- Free Fee Download
- Ambinter
CAS
Ambinter (partial)
- Chem DB
Chemische Zentrallblatt ChemDB
- ChemNavigator
CrossFire Beilstein GDB - Berne
- ChemSpider
Derwent
PubChem
- eMolecules
DiscoveryGate Ryan
Scientific
- GDB – Berne
GKV BIO
- IBM Patents
Index Chemicus
- NCI
SPRESI
- PubChem
- QueryChem
- Ryan Scientific
- SureChem
- ZINC
|
|
67
|
- CAS RN SMILES InChI Names
- (all partial except CAS)
- Bio-Rad
Bio-Rad
- ChemDB
ChemDB
ChemDB
- CAS
CAS
-
Chemisches Zentralblatt
- ChemNavigator
ChemNavigator
ChemNavigator
- ChemSpider
ChemSpider
ChemSpider
ChemSpider
- CrossFire Beilstein CrossFire
Beilstein CrossFire Beilstein
-
Derwent
- DiscoveryGate
DiscoveryGate
DiscoveryGate
DiscoveryGate
-
eMolecules
eMolecules
-
GDB
-
GKV BIO
-
IBM Patents IBM
Patents IBM Patents
-
Index Chemicus
- NCI
NCI
NCI
NCI
-
QueryChem
QueryChem
- PubChem
PubChem
PubChem
PubChem
-
Ryan Scientific
-
Sure Chem
SureChem
SureChem
-
SPRESI
-
Thomson Pharma
- ZINC ZINC ZINC ZINC
|
|
68
|
- Patents Data Reactions Predictions
- CAS
CAS
CAS (some)
- Chemisches Zentralblatt
-
ChemSpider
- CrossFire Beilstein CrossFire
Beilstein CrossFire
Beilstein CrossFire
Beilstein
- Derwent
- DiscoveryGate
DiscoveryGate
DiscoverGate
DiscoveryGate
- GKV BIO
- IBM Patents
- Index Chemicus
Index Chemicus
-
NCI
NCI (some)
- SureChem
-
SPRESI
-
ZINC
|
|
69
|
|
|
70
|
- 1. Easy to generate (It will use existing software.)
- 2. Expressive (It will contain structural information.)
- 3. Unique/Unambiguous
- 4. Easy to search for structure via Internet search engines (Google,
Yahoo, Microsoft Live, etc.) using the InChI (hash) Key.
|
|
71
|
|
|
72
|
|
|
73
|
|
|
74
|
|
|
75
|
D-Fructose
(natural)
InChI=1/C6H12O6/c7-1-3(9)5(11)6(12)4(10)2-8/h3,5-9,11-12H,1-
2H2/t3-,5-,6-/m1/s1
InChIKey=BJHIKXHVCXFQLS-UYFOZJQFBH
L-Fructose
InChI=1/C6H12O6/c7-1-3(9)5(11)6(12)4(10)2-8/h3,5-9,11-12H,1-2H2/t3-,5-,6-/m0/s1
InChIKey=BJHIKXHVCXFQLS-FUTKDDECBR
|
|
76
|
|
|
77
|
|
|
78
|
- As any hash, may be not unique for HUGE datasets
- Estimated resistance (corresponds to ½ probability of a SINGLE
collision):
- 1st block: 6.1×109
molecular skeletons
- 2nd block: 3.7×105
stereo/tauto/isotopomers per
each skeleton
- Number of molecules in current databases: ~(3-4) ×107
- Testing:
- internal: up to 7.7×107 molecules
- independent: by ChemSpider (http://www.chemspider.com)
1.7×107 real molecules
- No collisions found.
|
|
79
|
- Publishers:
- Royal Society of Chemistry www.rsc.org/Publishing/Journals/ProjectProspect/
- Prous Science - Drugs of the Future
- BioMed Central - Chemistry Central www.chemistrycentral.com
- Other:
- 1. European Patent Office
(EPO)
|
|
80
|
|
|
81
|
- 1. InChI is the only
publicly available method for creating a unique chemical identifier for
a given chemical structure. In
addition InChI has a number of other value attributes noted below.
2. InChI is free-open source software. (Web 2.0)
3. Any organization (public and private) can use for internal
and/or external structure files at no cost. (Web 2.0)
- The Web 2.0 is the second
generation of web-based communities and hosted services — such as
social-networking sites — which facilitate collaboration and sharing
between users. Web 1.0 is where
information comes from one central source.
|
|
82
|
- 4. It is sponsored by IUPAC
and primarily implemented by the US scientific standards agency –
NIST.
5. It allows the chemistry community to use the InChI key as a universal chemical identifier.
This means InChI’s can be freely
searched for via Google/Yahoo/Microsoft Live and other Internet search
engines. (Web 2.0)
6. The InChI Key unlocks the data and information from all sites
around the world that choose to use it.
The InChI Key allows all those commercial chemical information
providers (e.g., Elsevier, Thomson,
Prous Science, and John
Wiley ) to have a free
structure and number/linking system. (Web 2.0)
|
|
83
|
- Philip Abrahams, Steve Bachrach, Steve Boyer, Colin Batchelor, Ted
Becker, Jost Bohlen, Pieter Bolman, Evan Bolton, Bob Bovenschulte, Steve
Bryant, Harry Collier, Alice Cooper, Nick Day, Rene Deplanque, Ron
Dunn, Simon Quellen Field, Guenter Grethe, Stevan Harnad, Wolf-Dietrich
Ihlenfeldt, Sami Kassab, Richard Kidd, Sandy Lawson, David Lipman, Gary
Mallard, Randy Marcinko, Bill Milne, Carmen Nitsche, Josep Prous, Chris
Reed, Rich Roberts, Peter Murray-Rust, Henry Rzepa, Peter Shepherd, Bill Town, Andrea
Twiss-Brooks, Don Walters, Wendy Warr, Tony Williams, and Ann Wolpert
|