Notes
Slide Show
Outline
1
How to link scientific, medical and healthcare chemical information using InChI and the InChIKey.

Stephen R. Heller
IUPAC
steve@hellers.com
2
The slides from this presentation can be found at :

http://www.hellers.com/steve/pub-talks/
(Microsoft April 2008 link)



The main web site for the IUPAC InChI project is:

http://www.iupac.org/inchi
3
 
4
The InChI Team
  •    Stephen R. Heller
     Alan McNaught
  • Igor Pletnev
  • Stephen E. Stein
    Dmitrii Tchekhovskoi
5
Outline of Presentation
    •  1. Background/History/Objective
  •      2. Development of InChI
  •       3. InChIKey
  •       4. InChI Adoption & Use
  •       5. Conclusion
6
Objective from the Health & Life Sciences perspective
7
     Objective of the InChI Project

The objective of the IUPAC Chemical Identifier Project  is to create a unique label, the IUPAC Chemical Identifier  (InChI), which will be an Open Source, Open Standard, non-proprietary identifier for  drugs (chemical substances)  that can be used in printed and electronic data sources thus enabling easier linking of diverse data and information compilations.  And help improve and save lives.
8
 
9
InChI
  •     Using an InChI/InChIKey knowing you find a match if it is there and not need to worry if it was coded differently by another person or program.


  •    The InChI/InChIKey is a system for both public and private (fee-based) sources.
10
InChI
  •      Using InChI means you can freely exchange and/or link  structure files with others within your organization and with any person or organization anywhere in the world knowing the structure name, the InChI/InChIKey, will be the same. You can search for the InChI/InChIKey on the Internet, using Microsoft Live Search, and other search engines (.e.g., Yahoo and Google).
11
 
12
 
13
InChI Characteristics
  • 1. Easy to generate (It will use existing software.)
  • 2. Expressive (It will contain structural information.)
  • 3. Unique/Unambiguous
  • 4. Easy to search for structure via Internet search engines (Google, Yahoo, Microsoft Live, etc.) using the InChI (hash) Key.
14
 
15
IUPAC Goal
  • Cooperation with and support of the chemical, biochemical, pharma, life sciences, healthcare, and scientific publishers in the development and use of the InChI/InChIKey
16
InChI is an agent of cooperation and change
17
 
18
 
19
 
20
 
21
 
22
 
23
InChIKey
  •           The InChI string has been found to be too long for Internet search engines to use, hence the need for a fixed length InChIKey. The InChIKey is a 25 character (14+8 = 22 +1 check + 1 flag + 1 dash)  hash code of the InChI string. It is made up to four (4) parts:


  •                               AAAAAAAAAAAAAA-BBBBBBBBCD


  •    14 characters for the basic structure
  •      8 characters for the layers
  •      1 character is a “check” character
  •      1 character is a flag indicating certain features
  •                      (e.g., fixed or not fixed hydrogen atoms)


  • A hash code is a fixed length condensed digital representation of a variable character string.


  • The InChIKey is based on truncated SHA-256 cryptographic hash function.
  •    (http://en.wikipedia.org/wiki/SHA-2)



24
InChIKey
  • The principal new features of the InChIKey  are:



  • A fixed-length (25-character) condensed digital representation of the
  • Identifier to be known as InChIKey. In particular, this will


  • * facilitate web searching, previously complicated by unpredictable breaking of InChI character strings by search engines


  • * allow development of a web-based InChI lookup service


  • * permit an InChI representation to be stored in fixed length fields


  • * make chemical structure database indexing easier


  • * allow verification of InChI strings after network transmission.


25
"Caffeine"
  • Caffeine:


  • InChI=1/C8H10N4O2/c1-10-4-9-6-5(10)7(13)12(3)8(14)11(6)2/h4H,1-3H3


  • InChIKey=RYYVLZVUVIJVGH-UHFFFAOYAW



  • First block (14 letters), encodes molecular skeleton (connectivity):
  • RYYVLZVUVIJVGH


  • Second block (8 letters), encodes proton positions (tautomers),
  • stereochemistry, isotopes, reconnected layer: UHFFFAOY


  • Flag character, indicates InChI version, presence/absence of fixed H
  • layer, isotopes, and stereochemistry: A


  • Check character: W


26
Really long InChI (Palytoxin)
27
D-Fructose (natural)

InChI=1/C6H12O6/c7-1-3(9)5(11)6(12)4(10)2-8/h3,5-9,11-12H,1- 2H2/t3-,5-,6-/m1/s1

InChIKey=BJHIKXHVCXFQLS-UYFOZJQFBH


L-Fructose

InChI=1/C6H12O6/c7-1-3(9)5(11)6(12)4(10)2-8/h3,5-9,11-12H,1-2H2/t3-,5-,6-/m0/s1

InChIKey=BJHIKXHVCXFQLS-FUTKDDECBR
28

Fructose
D-Fructose
InChIKey: BJHIKXHVCXFQLS-UYFOZJQFBH

Fructose
L-Fructose
InChIKey: BJHIKXHVCXFQLS-FUTKDDECBR


Note: First 14 characters of BOTH InChIKeys are the SAME!

The 1st block (14) encodes the connectivity.
The 2nd block (8) encodes proton positions (tautomers), stereochemistry, isotopes, etc.
Check Character
Flag Character – InChI version, presence/absence of stereo info/isotopes, etc.
29
Stereoisomers of menthol
30
 
31
 InChI URL’s
Main IUPAC InChI page:http://iupac.org/inchi/

InChI Google video lecture (11/06):
http://video.google.com/videoplay?docid=-6653695245776470969&q=heller+chemical

InChI Google video lecture (10/07): http://youtube.com/watch?v=F9XppyZg4E4

B. Kosata (Prague):
www.inchi.info

P. Murray-Rust/Nick Day (Cambridge): http://wwmm.ch.cam.ac.uk/inchifaq/

ChemSpider:
http://www.chemspider.com/inchi.asmx
32
Summary - Overall Features of InChI (1)
  •           1. InChI is the only publicly available method for creating a unique chemical identifier for a given chemical structure.  In addition InChI has a number of other value attributes noted below.

    2. InChI is free-open source software. 

    3. Any organization (public and private) can use for internal and/or external structure files at no cost.


33
Summary - Overall Features of InChI (2)
  •     4. It is sponsored by IUPAC and primarily implemented by the US scientific standards agency – NIST.

    5. It allows the scientific and medical - healthcare community to use the InChIKey  as a universal chemical identifier. This means  InChI’s can be freely searched for via Internet search engines. 

    6. The InChIKey unlocks the data and information from all sites around the world that choose to use it.  The InChIKey allows all those commercial chemical information providers (e.g., Thieme,  Elsevier, Thomson,  Prous Science, and John  Wiley )  to have a free structure and number/linking system.


34
 Acknowledgments
  • Philip Abrahams, Steve Bachrach, Colin Batchelor, Ted Becker, Jost Bohlen, Pieter Bolman, Evan Bolton, Steve Bryant, Harry Collier, Alice Cooper,  Nick Day, Rene Deplanque, Ron Dunn, Simon Quellen Field, Guenter Grethe, Wolf-Dietrich Ihlenfeldt, Sami Kassab, Richard Kidd, Sandy Lawson, David Lipman, Gary Mallard, Randy Marcinko, Bill Milne, Carmen Nitsche, Rudy Potenzone, Josep Prous, Chris Reed, Rich Roberts, Peter Murray-Rust, Henry Rzepa,  Peter Shepherd, Bill Town,  Wendy Warr, Tony Williams, and Ann Wolpert