Notes
Slide Show
Outline
1
The future of linking (most all) chemical information using InChI and the InChIKey.

Stephen R. Heller

steve@hellers.com
2
This will be an atypical talk.

 I have trouble being normal.
3
The slides from this presentation can be found at :

http://www.hellers.com/steve/pub-talks/
(Thieme November 2007 link)



The main web site for the IUPAC InChI project is:

http://www.iupac.org/inchi
4
The InChI Team
  •    Stephen R. Heller
     Alan McNaught
  • Igor Pletnev
  • Stephen E. Stein
    Dmitrii Tchekhovskoi
5
Outline of Presentation
    • 1. Background/History/Objective
  •      2. Development of InChI
  •      3. InChIKey
  •      4. InChI Adoption & Use
  •      5. Conclusion
6
Thieme-IUPAC Prize in Synthetic Organic Chemistry
 
The Thieme-IUPAC Prize is awarded every two years on the occasion of IUPAC's International Conference on Organic Synthesis (ICOS) to a scientist under 40 years of age, whose research has had a major impact on the field of synthetic organic chemistry. The Prize is sponsored jointly by Georg Thieme Verlag, IUPAC, and the Editors of Synthesis, Synlett, Science of Synthesis, and Houben-Weyl.
7
Date: Mon, 15 Nov 1999 18:48:30 -0500 (EST)
From: Stephen R. Heller<srheller@cliff.nal.usda.gov>
To: stein <sstein@enh.nist.gov>
Subject: Re: A strawman proposal

Steve-

First rough draft. Let's talk tomorrow about it.

Steve

--------------
11/15/99

An IUPAC Chemical Registry System

        In response to the upcoming March 2000 IUPAC meeting -
Representations of Molecular Structure: Nomenclature and its Alternatives
- I would like to propose the creation of an IUPAC public domain chemical
registry system.
…
8
                 Objective

The objective of the IUPAC Chemical Identifier Project  is to create a unique label, the IUPAC Chemical Identifier  (InChI), which will be an Open Source, non-proprietary identifier for chemical substances that can be used in printed and electronic data sources thus enabling easier linking of diverse data and information compilations.
9
InChI
  • A project whose time has come.  The Internet, an international scientific body (IUPAC), and international cooperation (US, UK, Czech Republic) have led to the rapid development, implementation, and use of InChI.


  • Furthermore, cooperation from software vendors, particularly those with structure drawing software, has made generating InChI’s very easy for all chemists.


10
InChI
  •    While InChI is an Open Source, public domain, system for creating a unique computer-readable identifier (“name”)  it is NOT a registry system.  InChI’s are created only by those who choose to adopt and use the algorithm.  Registry systems which index the literature are complementary to any InChI databases that anyone creates.


11
InChI
  •     Using an InChI/InChIKey knowing you find a match if it is there and not need to worry if it was coded differently by another person or program. InChI/InChIKey means you are no longer
    dependent on any proprietary system and you are much more likely be link to and to be linked from many, many more chemists and sources of chemical information than has been possible in the past. The InChI/InChIKey is a system for both public and private (fee-based) sources.
12
InChI
  •    Using InChI means you can freely exchange structure files with others within your organization and with any person or organization anywhere in the world knowing the structure name, the InChI/InChIKey, will be the same. You can search for the InChI/InChIKey on the Internet, using Google/Yahoo/Microsoft, etc.
13
 
14
InChI Characteristics
  • 1. Easy to generate (It will use existing software.)
  • 2. Expressive (It will contain structural information.)
  • 3. Unique/Unambiguous
  • 4. Easy to search for structure via Internet search engines (Google, Yahoo, Microsoft Live, etc.) using the InChI (hash) Key.
15
 
16
IUPAC Goal - Cooperation with and support of chemical industry and scientific publishers in the development, implementation, and use of InChI/InChIkey.
  • Publishers need to combat Open
  • Access activities with added value.


  •   InChI will do that for chemistry.
17
InChI is an agent of change
18
 
19
 
20
 
21
 
22
Example of Basic Connectivity
23
Example of Basic Connectivity
  • The input structure and its normalized structure  is shown below – dots correspond to pi-electrons and are shown for illustrative purposes only.
24
Example of a Tautomer : Guanine

This layer is derived from the Basic Layer by the logical removal of mobile H-atoms and the tagging of H-donor and H-receptor atoms.  The input structure and its normalized structure  is shown below – dots correspond to pi-electrons and are shown for illustrative purposes only.
25
 
26
InChIKey
  •           The InChI string has been found to be too long for Internet search engines to use, hence the need for a fixed length InChIKey. The InChIKey is a 25 character (14+8 = 22 +1 check + 1 flag + 1 dash)  hash code of the InChI string. It is made up to four (4) parts:


  •                               AAAAAAAAAAAAAA-BBBBBBBBCD


  •    14 characters for the basic structure
  •      8 characters for the layers
  •      1 character is a “check” character
  •      1 character is a flag indicating certain features
  •                      (e.g., fixed or not fixed hydrogen atoms)


  • A hash code is a fixed length condensed digital representation of a variable character string.


  • The InChIKey is based on truncated SHA-256 cryptographic hash function.
  •    (http://en.wikipedia.org/wiki/SHA-2)



27
InChIKey
  • The principal new features of the InChIKey  are:



  • A fixed-length (25-character) condensed digital representation of the
  • Identifier to be known as InChIKey. In particular, this will


  • * facilitate web searching, previously complicated by unpredictable breaking of InChI character strings by search engines


  • * allow development of a web-based InChI lookup service


  • * permit an InChI representation to be stored in fixed length fields


  • * make chemical structure database indexing easier


  • * allow verification of InChI strings after network transmission.


28
"Caffeine:"
  • Caffeine:


  • InChI=1/C8H10N4O2/c1-10-4-9-6-5(10)7(13)12(3)8(14)11(6)2/h4H,1-3H3


  • InChIKey=RYYVLZVUVIJVGH-UHFFFAOYAW



  • First block (14 letters), encodes molecular skeleton (connectivity):
  • RYYVLZVUVIJVGH


  • Second block (8 letters), encodes proton positions (tautomers),
  • stereochemistry, isotopes, reconnected layer: UHFFFAOY


  • Flag character, indicates InChI version, presence/absence of fixed H
  • layer,isotopes, and stereochemistry: A


  • Check character: W


29
Really long InChI (Palytoxin)
30
D-Fructose (natural)

InChI=1/C6H12O6/c7-1-3(9)5(11)6(12)4(10)2-8/h3,5-9,11-12H,1- 2H2/t3-,5-,6-/m1/s1

InChIKey=BJHIKXHVCXFQLS-UYFOZJQFBH


L-Fructose

InChI=1/C6H12O6/c7-1-3(9)5(11)6(12)4(10)2-8/h3,5-9,11-12H,1-2H2/t3-,5-,6-/m0/s1

InChIKey=BJHIKXHVCXFQLS-FUTKDDECBR
31

Fructose
D-Fructose
InChIKey: BJHIKXHVCXFQLS-UYFOZJQFBH

Fructose
L-Fructose
InChIKey: BJHIKXHVCXFQLS-FUTKDDECBR


Note: First 14 characters of BOTH InChIKeys are the SAME!

The 1st block (14) encodes the connectivity.
The 2nd block (8) encodes proton positions (tautomers), stereochemistry, isotopes, etc.
Check Character
Flag Character – InChI version, presence/absence of stereo info/isotopes, etc.
32
Stereoisomers of menthol
33
InChIKey – collision resistance
  • As any hash, may be not unique for HUGE datasets
  • Estimated resistance (corresponds to ˝ probability of a SINGLE collision):
    • 1st block:  6.1×109 molecular skeletons
    • 2nd block:  3.7×105  stereo/tauto/isotopomers per each skeleton


  • Number of molecules in current databases: ~(3-4) ×107


  • Testing:
    • internal:  up to 7.7×107 molecules
    • independent: by ChemSpider (http://www.chemspider.com)
      1.7×107 real molecules
    • No collisions found.


34
 
35
 
36
 
37
Other InChI Adopters
  • Publishers:


  • Royal Society of Chemistry www.rsc.org/Publishing/Journals/ProjectProspect/


  • Prous Science - Drugs of the Future www.prous.com/journals/dof/20002507/index.cfm


  •  3.     BioMed Central - Chemistry Central www.chemistrycentral.com


  • Other:


  • 1.         European Patent Office (EPO)
38
 InChI URL’s
Main IUPAC InChI page:http://iupac.org/inchi/

InChI Google video lecture (11/06):
http://video.google.com/videoplay?docid=-6653695245776470969&q=heller+chemical

InChI Google video lecture (10/07): http://youtube.com/watch?v=F9XppyZg4E4

B. Kosata (Prague):
www.inchi.info

P. Murray-Rust/Nick Day (Cambridge): http://wwmm.ch.cam.ac.uk/inchifaq/

ChemSpider:
http://www.chemspider.com/inchi.asmx
39
Summary - Overall Features of InChI (1)
  •           1. InChI is the only publicly available method for creating a unique chemical identifier for a given chemical structure.  In addition InChI has a number of other value attributes noted below.

    2. InChI is free-open source software.  (Web 2.0)

    3. Any organization (public and private) can use for internal and/or external structure files at no cost. (Web 2.0)
  •          (The Web 2.0 is the second generation of web-based communities and hosted services — such as social-networking sites — which facilitate collaboration and sharing between users.  Web 1.0 is where information comes from one central source.)
40
Summary - Overall Features of InChI (2)
  •     4. It is sponsored by IUPAC and primarily implemented by the US scientific standards agency – NIST.

    5. It allows the chemistry community to use the InChIKey  as a universal chemical identifier. This means  InChI’s can be freely searched for via Google/Yahoo/Microsoft Live and other Internet search engines.  (Web 2.0)

    6. The InChIKey unlocks the data and information from all sites around the world that choose to use it.  The InChIKey allows all those commercial chemical information providers (e.g., Thieme,  Elsevier, Thomson,  Prous Science, and John  Wiley )  to have a free structure and number/linking system. (Web 2.0)


41
InChIKey & CAS RN – CAS features
  •  Will register any “chemical” - need not be a defined/definite structure


  • Charges a fee for a CAS RN or new CAS RN


  • Will not let people use CAS RN in large databases without a contract and  an ongoing fee


  •  Essentially (99+%) covers only the chemical literature -  CAS abstracts


  • CAS RN's are generated only at CAS


42
InChIKey & CAS RN – InChI features
  • Open Source
  • No cost to anyone (except labor for implementation)
  •  Key can be generated by anyone, anywhere
  •  Can be used internally and externally (Internet - web)
  • Very few InChIKeys are associated with the literature (<1%)
  • Is the only available structure representation that can be used world-wide in any database.
  •  InChIKey is created by database owner, not by            IUPAC or a central service/source.
  • Based on the above it is easy to understand why it is likely that the InChIKey will be the globally accepted standard for defining and describing a defined chemical substance.



43
 Acknowledgments
  • Philip Abrahams, Steve Bachrach, Colin Batchelor, Ted Becker, Jost Bohlen, Pieter Bolman, Evan Bolton, Bob Bovenschulte, Steve Bryant, Harry Collier, Alice Cooper,  Nick Day, Rene Deplanque, Ron Dunn, Simon Quellen Field, Guenter Grethe, Stevan Harnad, Wolf-Dietrich Ihlenfeldt, Sami Kassab, Richard Kidd, Sandy Lawson, David Lipman, Gary Mallard, Randy Marcinko, Bill Milne, Carmen Nitsche, Josep Prous, Chris Reed, Rich Roberts, Peter Murray-Rust, Henry Rzepa,  Peter Shepherd, Bill Town, Andrea Twiss-Brooks, Wendy Warr, Tony Williams, and Ann Wolpert