Notes
Outline
InChI
The IUPAC International Chemical Identifier
InChI: NIST Parentage
NIST/EPA/NIH Mass Spectral Database (1976)
32,000 connection tables for DOS display (1986)
Structures & Properties (1990)
Reference Data & Structure-Based Property Prediction
NIST Chemistry Webbook (1991)
Proposed ‘Standard’ canonical connection table (1992)
Too early
NIST + IUPAC (2000)
Slide 3
Slide 4
Slide 5
Slide 6
Slide 7
2002-2005
Salts and H-migration
Organometallics
Protonation
Test/refinement
Final InChI structure
Slide 9
Formal Requirement 1
Different compounds have different identifiers
All distinguishing structural information is included
Formal Requirement 2
A compound has a single identifier
Include only necessary information
Practical Requirements
Represent common ‘equilibrated mixtures’
Handles incomplete stereochemistry
Deal with coordination/imprecise bonding
Suitable for individual and community
Two Problems
Chemicals
Equilibrated Mixtures
Tautomerization, protonation
Imprecise connectivity
Organometallics, boranes
Chemists
Differing conventions
Depends on discipline, education and convenience
Unrecognized uncertainty
‘Don’t know what they don’t know’
Graphics oriented
Chemical or Mixture?
A ‘Chemical’ is often an equilibrium mixture of set of distinct entities
H2SO4 ó H+ + HSO4– ó 2H + + SO4–2
RCO2H ó H+ + RCO2 –
Depends on chemical environment (pH, T, …)
Details often not known
Must Deal with Common Input Ambiguities
3 Steps to InChI
Chemistry
‘Normalize’ Input Structure
Implement chemical rules
Math
‘Canonicalize’ (label the atoms)
Equivalent atoms get the same label
Format
‘Serialize’ Labeled Structure
Output as character string (‘name’)
Normalization
Divide input structure into ‘layers’
Each layer ‘refines’ structure
Ignore ‘Electron Density’
Use simple ‘connectivity’ only
Ignore bond type and charge location
Stereochemistry
sp2 and sp3 only
Free rotation around single bonds
Mostly Ignore Electron Density
Not required for compound identification
Difference densities => Different electronic states
Simplification
Delocalization, aromaticity, zwitterions, coordination …
Use input bonding for stereo/tautomer perception
When not otherwise specified
Slide 19
IUPAC International Chemical Identifier (InChI) – Technical Issues
InChI Scope
Discrete, covalently bonded compounds
Isotopes
Stereochemistry:
sp3 (tetrahedral)
sp2 (double bonds)
Tautomers
Variable Protonation
InChI Requirements: Unique Identifier
One Compound « One Identifier:
Compound = Chemical structure
Identifier (InChI) = String of characters
Produce same InChI from various representations of the same compound
Possibility to restore a chemical structure from InChI
InChI Requirements: Layers to express the degree of details
Empirical formula
Molecular skeleton
Structure with mobile hydrogen atoms
Charge and proton balance
Structure with fixed mobile hydrogen atom locations
Isotopic composition
Stereochemistry for:
Mobile H
Fixed Mobile H
Mobile H, Isotopic
Fixed Mobile H, Isotopic
Layered Identifier
Layered Identifier
Structure ® InChI: Normalization-1
Discover redundant information
Reduce structure to common drawing conventions
Remove/add protons to make charge closer to zero
Discover possible locations of mobile charges
Discover possible positions of mobile H
Allow H, charges, and radicals to migrate to reduce the number of stereogenic elements
Structure ® InChI: Normalization-2
Discard:
Electron density
Bond order
Charge/radical locations
Aromaticity
Excited states
Stereo information for
changeable bonds
bonds in less than 8-member rings
Keep:
All atoms ± protons
Connections between atoms
Hydrogen atoms:
All Mobile H locations
Original  H locations
Spatial arrangements around stereogenic bonds and atoms
Normalization Details-1
Reduction to common conventions
Normalization Details-2
Replacing charges with increased bond order
Normalization Details-3
Replacing charges with increased bond order
Slide 31
Slide 32
Fixing bonds to preserve functional groups - Example
Rules for fixing bonds
Rules for fixing bonds (cont.)
Cancel radicals
Remove charges by adding and/or removing protons
Aggressive (De)protonation
Tautomerism definitions
Testing bonds
Testing Mobile H – similar to testing bonds
Canonicalization
Obtaining canonical numbers of the atoms in the structure (based on a well-known graph algorithm by B. D. McKay, 1981 [6])
The numbers must be created in such a way that allows the layered structure of InChI:
Adding each next layer does not change the preceding layers
The identifier is unique
Simple Canonicalization Example
Canonicalization: the 1st colors
Canonicalization: equitable partition
Canonicalization: equitable partition
Continue by using new “colors” as 1st colors and calculate next new colors until there are no more changes
If there are still equal “colors” then reduce one of the smallest equal “colors” to the first color less than it + 1 and use this set of colors as 1st colors, etc., until all colors are different. Save the colors and the connection table.
Repeat reductions of equal colors until all sequences of color reductions have been explored
Keep the set of colors that provides the “smallest” connection table. These colors are canonical numbers.
Slide 47
Slide 48
Canonicalization: find minimal connection table (shown 2 out of 16 possible numberings)
Canonicalization: find equivalent atoms and start over
Slide 51
Canonicalization: other layers
So far only the connections have been canonicalized. Each next layer requires adding one more ordered by canonical numbers list and a structural feature:
Atoms H in fixed positions
Groups of mobile H and attachment points
Isotopic “weights” of atoms
The minimization is repeated; it leaves unchanged all previously minimized preceding layers and adds one more
Canonicalization: Stereochemistry
One more list containing triplets: ordered canonical numbers of atoms at the ends of stereogenic bonds and bond parity.
One more list containing ordered canonical numbers of stereogenic atoms and their parities.
Included is a heuristic algorithm that allows to recognize most of atoms and double bonds that are not stereogenic due to symmetry.
Slide 54
Slide 55
2-D Drawing correctness
 definitions (4 ligands)
2-D Drawing correctness
 definitions (3 ligands)
Stereochemistry Representation
InChI creation options
The options are designed to provide the InChI algorithm with information necessary to include known information and omit the unknown or unessential
If a substance behaves as a mixture of tautomeric forms then do not include Fixed-H layer
If stereochemistry is unknown but 3-D coordinates are provided then do not include the stereochemical layer
If a maximum level of detail is needed to be included then include Fixed-H, Stereo, and Reconnected layers
Make sure that 2D stereo drawing conventions are consistent with use of NEWPS option (see the next slide)
If Absolute stereo is marked with the Chiral Flag then include the SUCF option
If all hydrogen atoms are included in the structure then use DoNotAddH option
InChI creation options
InChI Stream Editor (SED) scripts
InChI package includes a set of scripts designed to remove InChI layers to facilitate search and compare.
InChI Stream Editor (SED) scripts
Replace Absolute stereo with Relative
Remove Fixed-H layer from InChI
Remove Isotopic layer from InChI
Remove Stereochemical layer from InChI
Remove "Include bonds to metal" layer from InChI
Remove sp3 Stereo layer from InChI
Remove Charge and Proton layers from InChI
Remove Disconnected structure info from InChI
Remove All Reversibility Info from AuxInfo
Remove Exact Reversibility Info from AuxInfo
Remove Mapping of canonical numbers from AuxInfo
wInChI Program
wInChI Annotated Output
wInChI Annotated Output
 (continued)
wInChI Program Options
InChI validation protocol
Designed primarily to verify that
InChI code used in 3rd party applications works correctly
InChI code ported to another compiler or operating system works correctly
Contains
Structures that test most of InChI logics described in InChI documentation
InChI produced out of these structures
Validation instructions
(in preparation)
References
S. Stein, S. Heller, D. Tchekhovskoi, "The IUPAC Chemical Identifier – Technical Manual", April 2005; http:/www.iupac.org/inchi.
"User’s Guide: IUPAC International Chemical Identifier (InChI) Program", April 2005; http:/www.iupac.org/inchi.
Mockus, J., Stobaugh, R. E,"The Chemical Abstracts Service Chemical Registry System. VII. Tautomerism and Alternating Bonds"; J. Chem. Inf. Comput. Sci. 1980, 20, pp. 18-22.
Blackwood, J. E., Blower, P. E., Jr., Layten, S. W., Lillie, D. H., Lipkus, A. H., Peer, J. P., Qian, C., Staggenborg, L. M., Watson, C. E., "Chemical Abstracts Service Chemical Registry System. 13. Enhanced Handling of Stereochemistry", J. Chem. Inf. Comput. Sci., 1991, vol. 31, pp. 204-212.
W. Kocay, D. Stone, "An Algorithm for Balanced Flows", The Journal of Combinatorial Mathematics and Combinatorial Computing, 1995, vol. 19, pp. 3-31.
B. D. McKay, "Practical Graph Isomorphism", Congressus Numerantium, 1981, Vol. 30, pp. 45 – 87.
G. Butler, “Fundamental Algorithms for Permutational Groups”, Berlin ; New York: Springer-Verlag, 1991 (Series: Lecture Notes in Computer Science, 559), Chapter 11.
INChI availability
Standalone programs for Win32 and i386 Linux, including command line program
InChI Software library: for linking to user’s application through InChI API
Documentation, examples, SED scripts
Source code
Available for download at
http://www.iupac.org/inchi
Current Applications
NIH/NCBI – PubChem
NIH/NCI Database
EPA/DSSTox
NIST Webbook
ACD/ChemDraw
EBI/ChEBI (coming)
Others ….
Slide 71
CML & InChI
CML: a vehicle for chemical data transmission
InChI:  a tag for chemical identification
<molecule><identifier>
<inchi>1/C14H22O2/c1-13(2,3)9-7-12(16)10(8-11(9)15)14(4,5)6/h1-6H3,7-8H,15-16H</inchi>
<identifier></molecule>
Remaining Problem:
No Accepted Depiction ‘Standards’
Hashed wedge
Explicit H-representation
Relative stereo and d/l mixtures
Coordination compounds
Tautomerism, protonation
….
Slide 74
http://www.iupac.org/inchi
Nick Day’s InChI FAQ
Slide 77
Take Home
InChI represents isolated or H-equilibrated compounds.
Care needed when preparing InChI input
Legacy - switches or preprocessing
New – appropriate user controls
Layers allow structure detail, but add complexity
Matching compounds may require layer processing
Chemical Complexities
Stereochemistry, tautomerism/protonation, other equilibration, bonds to metals, mixtures