Notes
Outline
The IUPAC Chemical Identifier
Steve Stein, Steve Heller,
Dmitrii Tchekhovskoi
National Institute of Standards and Technology
Gaithersburg, MD, USA


 
U.S. Government Chemical Databases
Frederick, MD
July 22, 2003
Too Many Identifiers
Structure diagrams
various conventions
contain ‘too much’ information
Connection Tables
MolFiles, Smiles, ROSDAL, ..
Pronounceable names
IUPAC, CAS, trivial
Index Numbers
EINECS, FEMA, DOT, RTECS, CAS, Beilstein, USP, RTECS, EEC, RCRA, NCI, UN, USAF
What kind of Identifier is needed?
Derived from structure by algorithm
Accepts common drawing conventions
Exactly one Identifier per structure
Comprehensive
Openly available
Requirements
Different compounds have different identifiers
All distinguishing structural information is included
Requirements
One compound has only one identifier
Include only necessary information
IChI
First Version
Discrete, bonded compounds
Include ‘dot disconnected’ compounds
Stereochemistry
sp3 - tetrahedral
Z/E - double bond
Tautomers
3 Steps to IChI
‘Normalize’ Input Structure
Defined input structure required
Remove conventions with chemical rules
Divide into ‘layers’
‘Canonicalize’ (label the atoms)
Equivalent atoms get the same label
‘Serialize’ the Labeled Structure
A unique series of bytes
NORMALIZATION
Simplifications
Ignore ‘Electron Density’
Double/Triple/Coordination bonds
Odd-electrons/Charges
Free Rotation Around Single Bonds
Separate structure information into ‘layers’
Ignore Electron Density
Not required for compound identification
Represent ‘excited states’
Simplify representations
Delocalization, aromaticity, zwitterions, coordination …
Slide 11
Slide 12
Slide 13
Assume Free Rotation Around Single Bonds
LAYERS
Divide into ‘Layers’
Formula
Connectivity
Disconnect metals
Connect metals
Isotopes
Stereochemistry
Tautomers (on/off)
Basic Layer
Non-Metals Only
Just atoms and their neighbors
Ignore everything else
Non-Metals:
CHNO
SPSi
Halogens
Metals Layer
Bonds to ‘Metals’
All bonds to metals
Ignore bond type
Empty if no bonds to metals
Use when bonding is known and significant
Isotopes
Stereochemistry
Double Bond (Z/E)
Coordinates
Defined parity
Tetrahedral (sp3)
‘in/out’ bonds
x,y,z coordinates
Defined parity
Varieties of Double Bond Isomers
sp3 (tetrahedral)
stereoisomers
Identify Stereogenic Centers
Speed up processing
Helpful for chemists
Basic Tautomer Layer
Tautomers
‘Salt’ Tautomers
Electronic Layer
Net Charge
Slide 28
IChI Output
9 possible fields
Basic – non-metal ##
Metal ##
Isotopic ##
Stereo ##
Stereo ##
Tautomer – non-metal ##
Metal ##
Isotopic ##
Stereo ##
Stereo ##
Electronic ##
Output Format
     Example: Benzene

Represent atoms as sequence number in formula
     C6H6   =  C  C  C  C  C  C  H  H  H  H  H  H
     tags          1  2   3  4   5   6   7  8   9 10 11 12
     Basic Layer:
     <basic>C6H6 1-2-7 2-3-8 3-4-9 4-5-10 5-6-11 7-12</basic>
Slide 31
Slide 32
Non-IChI Output
Information Only
For user verification
Label true stereogenic atoms
Identify equivalent atoms
Warnings
Unusual valences
Unrecognized input
‘Reversibility’ Information
Coordinates
Electron density
Positions of double/triple bonds, charges, odd electrons
Slide 34
Slide 35
Slide 36
Slide 37
Future Extensions
Other Stereo Forms
Non-atom centered
Conformations
Hydrogen Bonding
Polymers/Macromolecules
Compound Classes
Markush structures