|
|
|
Steve Stein, Steve Heller,
Dmitrii Tchekhovskoi |
|
|
|
National Institute of Standards and Technology |
|
Gaithersburg, MD, USA |
|
U.S. Government Chemical Databases |
|
Frederick, MD
July 22, 2003 |
|
|
|
|
|
Structure diagrams |
|
various conventions |
|
contain ‘too much’ information |
|
Connection Tables |
|
MolFiles, Smiles, ROSDAL, .. |
|
Pronounceable names |
|
IUPAC, CAS, trivial |
|
Index Numbers |
|
EINECS, FEMA, DOT, RTECS, CAS, Beilstein, USP, RTECS,
EEC, RCRA, NCI, UN, USAF |
|
|
|
|
|
|
Derived from structure by algorithm |
|
|
|
Accepts common drawing conventions |
|
|
|
Exactly one Identifier per structure |
|
|
|
Comprehensive |
|
|
|
Openly available |
|
|
|
|
|
Different compounds have different identifiers |
|
All distinguishing structural information is
included |
|
|
|
|
|
One compound has only one identifier |
|
Include only necessary information |
|
|
|
|
|
Discrete, bonded compounds |
|
Include ‘dot disconnected’ compounds |
|
|
|
Stereochemistry |
|
sp3 - tetrahedral |
|
Z/E - double bond |
|
Tautomers |
|
|
|
|
|
‘Normalize’ Input Structure |
|
Defined input structure required |
|
Remove conventions with chemical rules |
|
Divide into ‘layers’ |
|
|
|
‘Canonicalize’ (label the atoms) |
|
Equivalent atoms get the same label |
|
|
|
‘Serialize’ the Labeled Structure |
|
A unique series of bytes |
|
|
|
|
|
|
|
Ignore ‘Electron Density’ |
|
Double/Triple/Coordination bonds |
|
Odd-electrons/Charges |
|
|
|
Free Rotation Around Single Bonds |
|
Separate structure information into ‘layers’ |
|
|
|
|
|
Not required for compound identification |
|
Represent ‘excited states’ |
|
|
|
Simplify representations |
|
Delocalization, aromaticity, zwitterions,
coordination … |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Formula |
|
Connectivity |
|
Disconnect metals |
|
Connect metals |
|
Isotopes |
|
Stereochemistry |
|
Tautomers (on/off) |
|
|
|
|
|
Just atoms and their neighbors |
|
Ignore everything else |
|
|
|
Non-Metals: |
|
CHNO |
|
SPSi |
|
Halogens |
|
|
|
|
|
All bonds to metals |
|
Ignore bond type |
|
|
|
Empty if no bonds to metals |
|
|
|
Use when bonding is known and significant |
|
|
|
|
|
|
|
Double Bond (Z/E) |
|
Coordinates |
|
Defined parity |
|
|
|
Tetrahedral (sp3) |
|
‘in/out’ bonds |
|
x,y,z coordinates |
|
Defined parity |
|
|
|
|
|
|
|
|
Speed up processing |
|
Helpful for chemists |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Basic – non-metal ## |
|
Metal ## |
|
Isotopic ## |
|
Stereo ## |
|
Stereo ## |
|
|
|
Tautomer – non-metal ## |
|
Metal ## |
|
Isotopic ## |
|
Stereo ## |
|
Stereo ## |
|
Electronic ## |
|
|
|
|
Example: Benzene |
|
Represent atoms as sequence number in formula |
|
|
|
C6H6
= C C C
C C C H H H H H H |
|
tags 1 2 3 4 5
6 7 8 9 10 11 12 |
|
Basic Layer: |
|
<basic>C6H6 1-2-7 2-3-8 3-4-9 4-5-10 5-6-11 7-12</basic> |
|
|
|
|
|
|
|
|
|
|
Information Only |
|
For user verification |
|
Label true stereogenic atoms |
|
Identify equivalent atoms |
|
Warnings |
|
Unusual valences |
|
Unrecognized input |
|
|
|
‘Reversibility’ Information |
|
Coordinates |
|
Electron density |
|
Positions of double/triple bonds, charges, odd
electrons |
|
|
|
|
|
|
|
|
|
|
|
|
|
Other Stereo Forms |
|
Non-atom centered |
|
Conformations |
|
Hydrogen Bonding |
|
|
|
Polymers/Macromolecules |
|
Compound Classes |
|
Markush structures |
|