|
1
|
|
|
2
|
|
|
3
|
|
|
4
|
- Stephen R. Heller
Alan McNaught
- Igor Pletnev
- Stephen E. Stein
Dmitrii Tchekhovskoi
|
|
5
|
- 1. Background/History/Objective
- 2. Development of InChI
- 3. InChIKey
- 4. InChI Adoption & Use
- 5. Conclusion
|
|
6
|
|
|
7
|
|
|
8
|
|
|
9
|
- A project whose time has come.
The Internet, an international scientific body (IUPAC), and
international cooperation (US, UK, Czech Republic) have led to the rapid
development, implementation, and use of InChI.
- Furthermore, cooperation from software vendors, particularly those with
structure drawing software, has made generating InChI’s very easy for
all chemists.
|
|
10
|
- While InChI is an Open Source,
public domain, system for creating a unique computer-readable identifier
(“name”) it is NOT a registry
system. InChI’s are created only
by those who choose to adopt and use the algorithm. Registry systems which index the
literature are complementary to any InChI databases that anyone creates.
|
|
11
|
- Using an InChI/InChIKey
knowing you find a match if it is there and not need to worry if it was
coded differently by another person or program. InChI/InChIKey means you
are no longer
dependent on any proprietary system and you are much more likely
be link to and to be linked from many, many more chemists and sources of
chemical information than has been possible in the past. The
InChI/InChIKey is a system for both public and private (fee-based) sources.
|
|
12
|
- Using InChI means you can
freely exchange structure files with others within your organization and
with any person or organization anywhere in the world knowing the
structure name, the InChI/InChIKey, will be the same. You can search for
the InChI/InChIKey on the Internet, using Google/Yahoo/Microsoft, etc.
|
|
13
|
|
|
14
|
- 1. Easy to generate (It will use existing software.)
- 2. Expressive (It will contain structural information.)
- 3. Unique/Unambiguous
- 4. Easy to search for structure via Internet search engines (Google,
Yahoo, Microsoft Live, etc.) using the InChI (hash) Key.
|
|
15
|
|
|
16
|
- Publishers need to combat Open
- Access activities with added value.
- InChI will do that for
chemistry.
|
|
17
|
|
|
18
|
|
|
19
|
|
|
20
|
|
|
21
|
|
|
22
|
|
|
23
|
- The input structure and its normalized structure is shown below – dots correspond to
pi-electrons and are shown for illustrative purposes only.
|
|
24
|
|
|
25
|
|
|
26
|
- The InChI string has
been found to be too long for Internet search engines to use, hence the
need for a fixed length InChIKey. The InChIKey is a 25 character (14+8 =
22 +1 check + 1 flag + 1 dash)
hash code of the InChI string. It is made up to four (4) parts:
-
AAAAAAAAAAAAAA-BBBBBBBBCD
- 14 characters for the basic
structure
- 8 characters for the layers
- 1 character is a “check”
character
- 1 character is a flag
indicating certain features
- (e.g., fixed
or not fixed hydrogen atoms)
- A hash code is a fixed length condensed digital representation of a
variable character string.
- The InChIKey is based on truncated SHA-256 cryptographic hash function.
-
(http://en.wikipedia.org/wiki/SHA-2)
|
|
27
|
- The principal new features of the InChIKey are:
- A fixed-length (25-character) condensed digital representation of the
- Identifier to be known as InChIKey. In particular, this will
- * facilitate web searching, previously complicated by unpredictable
breaking of InChI character strings by search engines
- * allow development of a web-based InChI lookup service
- * permit an InChI representation to be stored in fixed length fields
- * make chemical structure database indexing easier
- * allow verification of InChI strings after network transmission.
|
|
28
|
- Caffeine:
- InChI=1/C8H10N4O2/c1-10-4-9-6-5(10)7(13)12(3)8(14)11(6)2/h4H,1-3H3
- InChIKey=RYYVLZVUVIJVGH-UHFFFAOYAW
- First block (14 letters), encodes molecular skeleton (connectivity):
- RYYVLZVUVIJVGH
- Second block (8 letters), encodes proton positions (tautomers),
- stereochemistry, isotopes, reconnected layer: UHFFFAOY
- Flag character, indicates InChI version, presence/absence of fixed H
- layer,isotopes, and stereochemistry: A
- Check character: W
|
|
29
|
|
|
30
|
D-Fructose
(natural)
InChI=1/C6H12O6/c7-1-3(9)5(11)6(12)4(10)2-8/h3,5-9,11-12H,1-
2H2/t3-,5-,6-/m1/s1
InChIKey=BJHIKXHVCXFQLS-UYFOZJQFBH
L-Fructose
InChI=1/C6H12O6/c7-1-3(9)5(11)6(12)4(10)2-8/h3,5-9,11-12H,1-2H2/t3-,5-,6-/m0/s1
InChIKey=BJHIKXHVCXFQLS-FUTKDDECBR
|
|
31
|
|
|
32
|
|
|
33
|
- As any hash, may be not unique for HUGE datasets
- Estimated resistance (corresponds to ˝ probability of a SINGLE
collision):
- 1st block: 6.1×109
molecular skeletons
- 2nd block: 3.7×105
stereo/tauto/isotopomers per
each skeleton
- Number of molecules in current databases: ~(3-4) ×107
- Testing:
- internal: up to 7.7×107 molecules
- independent: by ChemSpider (http://www.chemspider.com)
1.7×107 real molecules
- No collisions found.
|
|
34
|
|
|
35
|
|
|
36
|
|
|
37
|
- Publishers:
- Royal Society of Chemistry www.rsc.org/Publishing/Journals/ProjectProspect/
- Prous Science - Drugs of the Future www.prous.com/journals/dof/20002507/index.cfm
- 3. BioMed Central - Chemistry Central www.chemistrycentral.com
- Other:
- 1. European Patent Office
(EPO)
|
|
38
|
|
|
39
|
- 1. InChI is the only
publicly available method for creating a unique chemical identifier for
a given chemical structure. In
addition InChI has a number of other value attributes noted below.
2. InChI is free-open source software. (Web 2.0)
3. Any organization (public and private) can use for internal
and/or external structure files at no cost. (Web 2.0)
- (The Web 2.0 is the
second generation of web-based communities and hosted services — such as
social-networking sites — which facilitate collaboration and sharing
between users. Web 1.0 is where
information comes from one central source.)
|
|
40
|
- 4. It is sponsored by IUPAC
and primarily implemented by the US scientific standards agency –
NIST.
5. It allows the chemistry community to use the InChIKey as a universal chemical identifier.
This means InChI’s can be freely
searched for via Google/Yahoo/Microsoft Live and other Internet search
engines. (Web 2.0)
6. The InChIKey unlocks the data and information from all sites
around the world that choose to use it.
The InChIKey allows all those commercial chemical information
providers (e.g., Thieme,
Elsevier, Thomson, Prous
Science, and John Wiley ) to have a free structure and
number/linking system. (Web 2.0)
|
|
41
|
- Will register any “chemical” -
need not be a defined/definite structure
- Charges a fee for a CAS RN or new CAS RN
- Will not let people use CAS RN in large databases without a contract
and an ongoing fee
- Essentially (99+%) covers only
the chemical literature - CAS
abstracts
- CAS RN's are generated only at CAS
|
|
42
|
- Open Source
- No cost to anyone (except labor for implementation)
- Key can be generated by anyone,
anywhere
- Can be used internally and
externally (Internet - web)
- Very few InChIKeys are associated with the literature (<1%)
- Is the only available structure representation that can be used
world-wide in any database.
- InChIKey is created by database
owner, not by IUPAC or
a central service/source.
- Based on the above it is easy to understand why it is likely that the
InChIKey will be the globally accepted standard for defining and
describing a defined chemical substance.
|
|
43
|
- Philip Abrahams, Steve Bachrach, Colin Batchelor, Ted Becker, Jost
Bohlen, Pieter Bolman, Evan Bolton, Bob Bovenschulte, Steve Bryant,
Harry Collier, Alice Cooper, Nick Day, Rene Deplanque, Ron Dunn,
Simon Quellen Field, Guenter Grethe, Stevan Harnad, Wolf-Dietrich
Ihlenfeldt, Sami Kassab, Richard Kidd, Sandy Lawson, David Lipman, Gary
Mallard, Randy Marcinko, Bill Milne, Carmen Nitsche, Josep Prous, Chris
Reed, Rich Roberts, Peter Murray-Rust, Henry Rzepa, Peter Shepherd, Bill Town, Andrea
Twiss-Brooks, Wendy Warr, Tony Williams, and Ann Wolpert
|