|
|
|
|
|
|
NIST/EPA/NIH Mass Spectral Database (1976) |
|
32,000 connection tables for DOS display (1986) |
|
|
|
Structures & Properties (1990) |
|
Reference Data & Structure-Based Property
Prediction |
|
|
|
NIST Chemistry Webbook (1991) |
|
|
|
Proposed ‘Standard’ canonical connection table
(1992) |
|
Too early |
|
|
|
NIST + IUPAC (2000) |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Salts and H-migration |
|
Organometallics |
|
Protonation |
|
Test/refinement |
|
Final InChI structure |
|
|
|
|
|
|
|
Different compounds have different identifiers |
|
All distinguishing structural information is
included |
|
|
|
|
|
A compound has a single identifier |
|
Include only necessary information |
|
|
|
|
Represent common ‘equilibrated mixtures’ |
|
|
|
Handles incomplete stereochemistry |
|
|
|
Deal with coordination/imprecise bonding |
|
|
|
Suitable for individual and community |
|
|
|
|
|
|
Chemicals |
|
Equilibrated Mixtures |
|
Tautomerization, protonation |
|
Imprecise connectivity |
|
Organometallics, boranes |
|
|
|
Chemists |
|
Differing conventions |
|
Depends on discipline, education and convenience |
|
Unrecognized uncertainty |
|
‘Don’t know what they don’t know’ |
|
Graphics oriented |
|
|
|
|
|
A ‘Chemical’ is often an equilibrium mixture of
set of distinct entities |
|
|
|
H2SO4 ó H+
+ HSO4– ó 2H + + SO4–2 |
|
RCO2H ó H+
+ RCO2 – |
|
|
|
Depends on chemical environment (pH, T, …) |
|
Details often not known |
|
|
|
|
|
|
|
|
Chemistry |
|
‘Normalize’ Input Structure |
|
Implement chemical rules |
|
|
|
Math |
|
‘Canonicalize’ (label the atoms) |
|
Equivalent atoms get the same label |
|
|
|
Format |
|
‘Serialize’ Labeled Structure |
|
Output as character string (‘name’) |
|
|
|
|
|
Divide input structure into ‘layers’ |
|
Each layer ‘refines’ structure |
|
|
|
Ignore ‘Electron Density’ |
|
Use simple ‘connectivity’ only |
|
Ignore bond type and charge location |
|
|
|
Stereochemistry |
|
sp2 and sp3 only |
|
Free rotation around single bonds |
|
|
|
|
|
Not required for compound identification |
|
Difference densities => Different electronic
states |
|
Simplification |
|
Delocalization, aromaticity, zwitterions,
coordination … |
|
Use input bonding for stereo/tautomer perception |
|
When not otherwise specified |
|
|
|
|
|
|
|
|
|
Discrete, covalently bonded compounds |
|
Isotopes |
|
Stereochemistry: |
|
sp3 (tetrahedral) |
|
sp2 (double bonds) |
|
Tautomers |
|
Variable Protonation |
|
|
|
|
One Compound « One Identifier: |
|
Compound = Chemical structure |
|
Identifier (InChI) = String of characters |
|
|
|
Produce same InChI from various representations
of the same compound |
|
Possibility to restore a chemical structure from
InChI |
|
|
|
|
|
Empirical formula |
|
Molecular skeleton |
|
Structure with mobile hydrogen atoms |
|
Charge and proton balance |
|
Structure with fixed mobile hydrogen atom
locations |
|
Isotopic composition |
|
Stereochemistry for: |
|
Mobile H |
|
Fixed Mobile H |
|
Mobile H, Isotopic |
|
Fixed Mobile H, Isotopic |
|
|
|
|
|
|
|
|
Reduce structure to common drawing conventions |
|
Remove/add protons to make charge closer to zero |
|
Discover possible locations of mobile charges |
|
Discover possible positions of mobile H |
|
Allow H, charges, and radicals to migrate to
reduce the number of stereogenic elements |
|
|
|
|
|
Discard: |
|
Electron density |
|
Bond order |
|
Charge/radical locations |
|
Aromaticity |
|
Excited states |
|
Stereo information for |
|
changeable bonds |
|
bonds in less than 8-member rings |
|
Keep: |
|
All atoms ± protons |
|
Connections between atoms |
|
Hydrogen atoms: |
|
All Mobile H locations |
|
Original
H locations |
|
Spatial arrangements around stereogenic bonds
and atoms |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Obtaining canonical numbers of the atoms in the
structure (based on a well-known graph algorithm by B. D. McKay, 1981 [6]) |
|
The numbers must be created in such a way that
allows the layered structure of InChI: |
|
Adding each next layer does not change the
preceding layers |
|
The identifier is unique |
|
|
|
|
|
|
|
|
|
|
Continue by using new “colors” as 1st
colors and calculate next new colors until there are no more changes |
|
If there are still equal “colors” then reduce
one of the smallest equal “colors” to the first color less than it + 1 and
use this set of colors as 1st colors, etc., until all colors are
different. Save the colors and the connection table. |
|
Repeat reductions of equal colors until all
sequences of color reductions have been explored |
|
Keep the set of colors that provides the
“smallest” connection table. These colors are canonical numbers. |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
So far only the connections have been
canonicalized. Each next layer requires adding one more ordered by
canonical numbers list and a structural feature: |
|
Atoms H in fixed positions |
|
Groups of mobile H and attachment points |
|
Isotopic “weights” of atoms |
|
The minimization is repeated; it leaves
unchanged all previously minimized preceding layers and adds one more |
|
|
|
|
One more list containing triplets: ordered
canonical numbers of atoms at the ends of stereogenic bonds and bond
parity. |
|
One more list containing ordered canonical
numbers of stereogenic atoms and their parities. |
|
Included is a heuristic algorithm that allows to
recognize most of atoms and double bonds that are not stereogenic due to
symmetry. |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The options are designed to provide the InChI
algorithm with information necessary to include known information and omit
the unknown or unessential |
|
If a substance behaves as a mixture of
tautomeric forms then do not include Fixed-H layer |
|
If stereochemistry is unknown but 3-D
coordinates are provided then do not include the stereochemical layer |
|
If a maximum level of detail is needed to be
included then include Fixed-H, Stereo, and Reconnected layers |
|
Make sure that 2D stereo drawing conventions are
consistent with use of NEWPS option (see the next slide) |
|
If Absolute stereo is marked with the Chiral
Flag then include the SUCF option |
|
If all hydrogen atoms are included in the
structure then use DoNotAddH option |
|
|
|
|
|
|
InChI package includes a set of scripts
designed to remove InChI layers to facilitate search and compare. |
|
|
|
|
|
|
Replace Absolute stereo with Relative |
|
Remove Fixed-H layer from InChI |
|
Remove Isotopic layer from InChI |
|
Remove Stereochemical layer from InChI |
|
Remove "Include bonds to metal" layer
from InChI |
|
Remove sp3 Stereo layer from InChI |
|
Remove Charge and Proton layers from InChI |
|
Remove Disconnected structure info from InChI |
|
Remove All Reversibility Info from AuxInfo |
|
Remove Exact Reversibility Info from AuxInfo |
|
Remove Mapping of canonical numbers from AuxInfo |
|
|
|
|
|
|
|
|
|
|
|
|
Designed primarily to verify that |
|
InChI code used in 3rd party
applications works correctly |
|
InChI code ported to another compiler or
operating system works correctly |
|
Contains |
|
Structures that test most of InChI logics
described in InChI documentation |
|
InChI produced out of these structures |
|
Validation instructions |
|
(in preparation) |
|
|
|
|
S. Stein, S. Heller, D. Tchekhovskoi, "The
IUPAC Chemical Identifier – Technical Manual", April 2005;
http:/www.iupac.org/inchi. |
|
"User’s Guide: IUPAC International Chemical
Identifier (InChI) Program", April 2005; http:/www.iupac.org/inchi. |
|
Mockus, J., Stobaugh, R. E,"The Chemical
Abstracts Service Chemical Registry System. VII. Tautomerism and
Alternating Bonds"; J. Chem. Inf. Comput. Sci. 1980, 20, pp. 18-22. |
|
Blackwood, J. E., Blower, P. E., Jr., Layten, S.
W., Lillie, D. H., Lipkus, A. H., Peer, J. P., Qian, C., Staggenborg, L.
M., Watson, C. E., "Chemical Abstracts Service Chemical Registry
System. 13. Enhanced Handling of Stereochemistry", J. Chem. Inf.
Comput. Sci., 1991, vol. 31, pp. 204-212. |
|
W. Kocay, D. Stone, "An Algorithm for
Balanced Flows", The Journal of Combinatorial Mathematics and
Combinatorial Computing, 1995, vol. 19, pp. 3-31. |
|
B. D. McKay, "Practical Graph
Isomorphism", Congressus Numerantium, 1981, Vol. 30, pp. 45 – 87. |
|
G. Butler, “Fundamental Algorithms for
Permutational Groups”, Berlin ; New York: Springer-Verlag, 1991 (Series:
Lecture Notes in Computer Science, 559), Chapter 11. |
|
|
|
|
Standalone programs for Win32 and i386 Linux,
including command line program |
|
InChI Software library: for linking to user’s
application through InChI API |
|
Documentation, examples, SED scripts |
|
Source code |
|
Available for download at
http://www.iupac.org/inchi |
|
|
|
|
NIH/NCBI – PubChem |
|
NIH/NCI Database |
|
EPA/DSSTox |
|
NIST Webbook |
|
ACD/ChemDraw |
|
EBI/ChEBI (coming) |
|
Others …. |
|
|
|
|
|
|
CML: a vehicle for chemical data transmission |
|
InChI: a
tag for chemical identification |
|
<molecule><identifier> |
|
<inchi>1/C14H22O2/c1-13(2,3)9-7-12(16)10(8-11(9)15)14(4,5)6/h1-6H3,7-8H,15-16H</inchi> |
|
<identifier></molecule> |
|
|
|
|
Hashed wedge |
|
Explicit H-representation |
|
Relative stereo and d/l mixtures |
|
Coordination compounds |
|
Tautomerism, protonation |
|
…. |
|
|
|
|
|
|
|
|
|
|
|
|
|
InChI represents isolated or H-equilibrated
compounds. |
|
Care needed when preparing InChI input |
|
Legacy - switches or preprocessing |
|
New – appropriate user controls |
|
|
|
Layers allow structure detail, but add
complexity |
|
Matching compounds may require layer processing |
|
|
|
Chemical Complexities |
|
Stereochemistry, tautomerism/protonation, other
equilibration, bonds to metals, mixtures |
|