Notes
Outline
Open Standards for Chemical Information:

The IUPAC Chemical Identifier and
Data Dictionary Projects
 S. E. Stein, S. R. Heller, D. V. Tchekhovskoi
Physical & Chemical Properties Division
NIST
Slide 2
Slide 3
Digital Representation of Compounds
Chemical structure is the true ‘identifier’
But, structures are not unique or convenient for computers
Convert structure (connection table) to unique string of characters by algorithms
The Iupac CHemical Identifier (IChI)
Design ‘Philosophy’
Originators
Capable of full representation
Eliminate ‘conventions’
Maximum application
‘Clients’
Robust
Selectable specificity
Software
Implement through external structure processing software
Two Problems
Chemicals
Rapid reaction (tautomerization)
Ambiguous/uncertain structure
Chemists
Differing conventions
Based on discipline, education and convenience
3 Steps to IChI
Chemistry
‘Normalize’ Input Structure
Implement chemical rules
Math
‘Canonicalize’ (label the atoms)
Equivalent atoms get the same label
Convention
‘Serialize’ the Labeled Structure
Output as a Series of Bytes
Normalization:
Simplifications
Ignore ‘Electron Density’
Double/Triple/Coordination bonds
Odd-electrons/Charges
Stereochemistry
Free rotation around single bonds
No stereo < 8-membered rings (default)
Divide structure information into ‘layers’
Ignore Electron Density
Not required for compound identification
Represent ‘excited states’
Simplify representations
Delocalization, aromaticity, zwitterions, coordination …
Slide 10
Assume Free Rotation Around Single Bonds
Four Basic ‘Layers’
Formula
Connectivity
Stereochemistry
Isotopic ‘Corrections’
Connectivity Sublayers
Disconnect metals and H-atoms
Reconnect metals
Reconnect H-atoms
Non-mobile (non-tautomers)
Mobile (distinguish tautomers)
Basic Tautomer Layer
Tautomers
Stereochemical Sublayers
sp2 – double bond
sp3 – tetrahedral
{others added later}
Canonicalize:
Identify Stereogenic Centers
Byproduct of IChI Creation
Assist chemists for structure confirmation
Nitrobenzene
MSG tautomeric
MSG fixed
Ferrocene
Auxiliary Output
Byproduct
Label stereogenic atoms
Identify equivalent atoms
Warnings/Errors
Unusual valences
Unrecognized input
‘Reversibility’
Coordinates
Bond/Charge Location
Slide 23
Slide 24
Slide 25
Slide 26
Slide 27
Gold Book - Present
Each term on a PDF page
Looks like printed page
Links to other definitions
For display only
Not easily convertible
Graphics, symbols, equations
Text not ‘parsed’
No ‘metadata’
Slide 29
Gold Book - Translation
Text
Perceive and tag data types and relationships
Simple Structures
To connection tables/CML/SVG
Equations
To MathML
Figures & Complex Schemes
Redraw in SVG
Slide 31
Slide 32
Slide 33
Slide 34
Slide 35
Slide 36
Slide 37
Gold Book - Connectons
Use by Other MLs
Schema
Integrate dictionary via STMML
Internal Maintenance
Develop ‘authoring’ procedures
Gold Book – Impact
Provide uniform chemical terminology for XML documents
Traceablity of terminology
Root for the ‘tagging’ of chemistry
Model for future IUPAC recommendations
Slide 40
Green Book - Promise
‘Template’ for numeric property validation in chemistry
Ensure proper units and representation
Traceability  to IUPAC definition
Basic Standards for Numeric Data ‘Tagging’
Green Book - Promise
Periodic Table and Relative Molar Masses
‘Official’ digital source
Connect to relevant IUPAC information
Root of chemical information ‘tree’
Spectroscopy, electrochemistry, thermochemistry, catalysis, …
Next
Nov 12-14 Meeting at NIST
IChI
Final Beta Nov. 2002
Dissemination
Version 2
XML Data Dictionary
Finish Gold Book Conversion
Maintenance Path
Begin Green Book