|
|
|
Steve Stein, Steve Heller,
Dmitrii Tchekhovskoi |
|
|
|
National Institute of Standards and Technology |
|
Gaithersburg, MD, USA |
|
CAS/IUPAC Conference on Chemical Identifiers and XML for Chemistry
Columbus, OH
July 1, 2002 |
|
|
|
|
|
|
Mission |
|
International, open standards for chemical
communication |
|
|
|
Printed Media – Nomenclature |
|
Human communication |
|
Rules for structure to name conversion |
|
|
|
Digital Media – Identifier |
|
Computer communication |
|
Rules for structure to identifier conversion |
|
Freed from restrictions of ‘pronouncibility’ |
|
Freed from ring index |
|
|
|
|
Structures |
|
Connection Tables |
|
‘Trivial’ Names |
|
Systematic Names |
|
Index Numbers |
|
|
|
|
|
Structure diagrams |
|
various conventions |
|
contain ‘too much’ information |
|
Connection Tables |
|
MolFiles, Smiles, ROSDAL, .. |
|
Pronounceable names |
|
IUPAC, CAS, trivial |
|
Index Numbers |
|
EINECS, FEMA, DOT, RTECS, CAS, Beilstein, USP, RTECS,
EEC, RCRA, NCI, UN, USAF |
|
|
|
|
|
|
|
Exactly one Identifier per structure |
|
|
|
Defined by algorithms |
|
|
|
Comprehensive |
|
|
|
Openly available |
|
|
|
Implemented |
|
|
|
|
|
|
|
Different compounds have different identifiers |
|
All distinguishing structural information is
included |
|
|
|
|
|
One compound has only one identifier |
|
No unnecessary information is included |
|
|
|
|
|
Discrete, covalently bonded compounds |
|
foundation for other classes |
|
Isotopes |
|
Stereochemistry |
|
sp3 - tetrahedral |
|
Z/E - double bond |
|
Tautomers |
|
|
|
|
|
‘Normalize’ Input Structure |
|
Implement chemical rules |
|
|
|
‘Canonicalize’ (label the atoms) |
|
Equivalent atoms get the same label |
|
|
|
‘Serialize’ the Labeled Structure |
|
A unique series of bytes |
|
|
|
|
|
|
|
Ignore ‘Electron Density’ |
|
Double/triple bonds, Odd-electrons, Charges |
|
Still use for Z/E stereo perception |
|
Free Rotation Around Single Bonds |
|
Divide IChI into Layers |
|
|
|
|
|
Not required for compound identification |
|
Distinguishes ‘excited states’ |
|
|
|
Avoids problems |
|
Delocalization, aromaticity, zwitterions, … |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Separate ‘Name’ into Fragments by |
|
Connectivity |
|
Isotopes |
|
Stereochemistry |
|
Tautomerism |
|
|
|
|
|
Just atoms and their neighbors |
|
Ignore everything else |
|
|
|
Robust basic identifier |
|
|
|
|
|
|
|
Double Bond (Z/E) |
|
from coordinates or bonding |
|
|
|
Tetrahedral (sp3) |
|
‘in/out’ bonds or x,y,z coordinates |
|
|
|
|
|
|
|
|
Speed up processing |
|
Helpful for chemists |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Basic ## |
|
Isotopic ## |
|
Stereo ## |
|
Stereo ## |
|
|
|
Tautomeric ## |
|
Isotopic ## |
|
Stereo ## |
|
Stereo ## |
|
Electronic ## |
|
|
|
|
Example: Benzene |
|
Represent atoms as sequence number in formula |
|
|
|
C6H6
= C C C
C C C H H H H H H |
|
tags 1 2 3 4 5
6 7 8 9 10 11 12 |
|
Basic Layer: |
|
<basic>C6H6 1-2-7 2-3-8 3-4-9 4-5-10 5-6-11 7-12</basic> |
|
|
|
|
|
|
Information Only |
|
For user verification |
|
Label true stereogenic atoms |
|
Identify equivalent atoms |
|
Warnings |
|
Unusual valences |
|
Unrecognized input |
|
|
|
‘Reversibility’ Information |
|
Coordinates |
|
Electron density |
|
Positions of double/triple bonds, charges, odd
electrons |
|
|
|
|
|
|
|
|
|
|
|
|
|
Chemists |
|
Different ways to represent the same thing |
|
Different definitions of tautomerism |
|
Different guesses |
|
|
|
Chemicals |
|
Structures can depend on conditions |
|
Tautomers can depend on conditions |
|
|
|
|
|
|
|
|
|
|
|
Coordinates |
|
Structure display |
|
Original bonds and charges |
|
For display and future use |
|
Original numbering |
|
Map to input data |
|
|
|
|
|
Discover that two structures with different
connectivity represent the same compound |
|
Unless they are tautomers |
|
Predict potential for Z/E isomerism in open
shell conjugated networks |
|
Cannot predict rotational barriers |
|
Fix improperly entered data |
|
Guarantees wrong IChI for bad data |
|
Properly treat non-covalent bonding |
|
Coordinate bonds |
|
Represent ‘exotic’ stereochemistry |
|
|
|
|
|
|
|
Implement All Normalization Rules – 12/02 |
|
Test against available data sets – 3/03 |
|
Final External Testing and Refinement – 7/03 |
|
Documentation, source, executable – 12/03? |
|
|
|
Open discussions |
|
ichi-l@list.rsc.org |
|
|
|
|
|
Organometallics |
|
Coordinate bonds |
|
Other Stereo Forms |
|
Non-atom centered |
|
Conformations |
|
Hydrogen Bonding |
|
|
|
Polymers/Macromolecules |
|
Compound Classes |
|
Markush structures |
|