Notes
Slide Show
Outline
1
What the IUPAC/NIST Chemical Identifier (InChI) Means to
You
  • Stephen Heller

  • Physical & Chemical Properties Division
  • NIST
  • Gaithersburg, MD 20899-8380


  • srheller@nist.gov


2
The slides from this talk can be found at:
http://www.hellers.com/steve/pub-talks/fda-4-06/frame.htm
3
4 Useful InChI URL’s

IUPAC InChI URL:  http://www.iupac.org/projects/2004/2004-039-1-800.html


The InChI-L Listserver WebBoard URL: http://webboard.rsc.org:8088/~INCHI-L


InChI FAQ’s: Created by Nick Day, Cambridge University, UK:
http://wwmm.ch.cam.ac.uk/inchifaq/


IUPAC Prague Group InChI URL:
www.inchi.info
4
5 Critical Factors Affecting Chemical Information in the 21st Century

1. Internet - WWW
2. Internet - WWW
3. Internet - WWW
4. Internet - WWW
5. Internet - WWW
5
The Internet has created ONE marketplace. There are no longer mini-marketplaces. The way to make things work is to link these mini-markets.
6
The Internet has caused major changes for scientists around the world. The Open Access (Beilstein Journal of Organic Chemistry), Open Source (IUPAC -InChI), and Open Data (NIH Roadmap - PubChem) projects are a result of this new world.  These projects are now growing and changing the way scientists do business and changing the businesses and organizations that scientists use.
7
Those who only remember the past are condemned to misread the future.

F. Zakaria, Newsweek, 8/15/05
8
InChI

A project whose time has come.  The Internet, an international scientific body (IUPAC), and international cooperation (US, UK, Czech Republic) has led to the speedy development, implementation, and use of InChI.

While InChI is a public domain system for creating a unique computer-readable identifier (“name”)  it is NOT a registry system.  InChI’s are created only by those who choose to adopt and use the algorithm.  Registry systems which index the literature are complimentary to any InChI databases that anyone creates.
9
Unique identifier for chemical structures

If we are to be able to find chemistry on the web then we need a primary key/unique identifier to locate the data.

InChI = a revolutionary new approach to indexing full chemical structure:

1. Not dependent on specialised search software
2. Compatible with XML, HTML, database fields etc
3. Robust when deployed on Web
4. Open and International (IUPAC)

(Slide from Nick Day & Peter Murray-Rust/Cambridge)
10
How does InChI differ from SMILES?

Like InChI, the SMILES language allows a canonical serialization of molecular structure. However, SMILES is proprietary and unlike InChI is not an open project. This has led to the use of different generation algorithms, and thus, different SMILES versions of the same compound have been found.  In fact, we have found seven different unique SMILES for caffeine on Web sites:

1.[c]1([n+]([CH3])[c]([c]2([c]([n+]1[CH3])[n][cH][n+]2[CH3]))[O-])[O]
2. CN1C(=O)N(C)C(=O)C(N(C)C=N2)=C12 
3. Cn1cnc2n(C)c(=O)n(C)c(=O)c12
4. Cn1cnc2c1c(=O)n(C)c(=O)n2C
5. N1(C)C(=O)N(C)C2=C(C1=O)N(C)C=N2
6. O=C1C2=C(N=CN2C)N(C(=O)N1C)C
7. CN1C=NC2=C1C(=O)N(C)C(=O)N2C

(From InChI FAQ’s)
11
 
12
                                          InChI take up by software developers and database providers
Software:
1. Structure Drawing
  a. ACD Labs: ChemSketch http://www.acdlabs.com
  b. CambridgeSoft: ChemDraw http://www.camsoft.com
  c. ChemAxon: Marvin http://www.chemaxon.com
  d. BK-Chem: http://bkchem.zirael.org/inchi_en.html
2. Structure Search
  a. IBM (internal project)
3. Analysis software
  a. SciTegic: http://www.scitegic.com
4. Structure file interconversion
  a. OpenBabel: http://openbabel.sourceforge.net/RELEASE.shtml
5. Other software
  a. World Wide Molecular Matrix: http://wwmm.ch.cam.ac.uk/gridsphere/gridsphere 

Databases:
1. NIST WebBook http://webbook.nist.gov
2. NIH PubChem http://pubchem.ncbi.nlm.nih.gov
3. NCI DTP http://cactus.nci.nih.gov/ncidb2/
4. EPA - DSSTox http://www.epa.gov/nheerl/dsstox/
5. UC-SF ZINC project http://blaster.docking.org/zinc/
6. KEGG http://www.genome.ad.jp/kegg/
7. ISI Web of Science http://portal.isiknowledge.com/
8. Carcinogenic Potency http://potency.berkeley.edu/structure.html
9. ChEBI http://www.ebi.ac.uk/chebi
 
Information resource:
P. Murray-Rust/N. Day: http://wwmm.ch.cam.ac.uk/inchifaq/
13
Early InChI Adopters

NIST WebBook & Mass Spec – 150,000
NIH/NCBI/PubChem project – 5.3 million+
IBM Patents– 1.6+ million
 ISI Web of Science– 2+ million
NCI Database – 20 million+
EPA –DSSTox database – 1450
KEGG database – 9584
UCSF ZINC – 3.3million
Prous Science - 300,000
John Wiley – Mass Spec – 600,000
14
 
15
                                                         Web 1.0 vs. Web 2.0

In Web 1.0, a small number of writers created Web pages for a large number of readers. As a result, people could get information by going directly to the source: Adobe.com for graphic design issues, Microsoft.com for Windows issues, and CNN.com for news. Over time, however, more and more people started writing content in addition to reading it. This had an interesting effect—suddenly there was too much information to keep up with! We did not have enough time for everyone who wanted our attention and visiting all sites with relevant content simply wasn’t possible. As personal publishing caught on and went mainstream, it became apparent that the Web 1.0 paradigm had to change.



Enter Web 2.0, a vision of the Web in which information is broken up into “microcontent” units that can be distributed over dozens of domains. The Web of documents has morphed into a Web of data. We are no longer just looking to the same old sources for information. Now we’re looking to a new set of tools to aggregate and remix microcontent in new and useful ways.
16
Web 1.0  vs Web 2.0
 
Britannica Online à Wikipedia
Personal websites
à Blogging
Publishing
à Participation
Commonly-Applied Substance-Registration-Number
à InChI

http://www.oreillynet.com/pub/a/oreilly/tim/news/2005/09/30/what-is-web-20.html
17
Digital ‘Naming’ of Chemicals

Chemical structure is the true ‘identifier’

But, structure representations are not unique or convenient for computers.

So, convert structure to a unique ‘name’ by fixed algorithms The IUPAC International Chemical Identifier (InChI)
18
Two Problems

Chemicals
1. Fast isomerization (tautomerization)
2. Ill-defined connectivity

Chemists
1. Differing conventions
                            Depends on discipline, education and convenience
2. Imprecision/uncertainty
19
3 Steps to InChI

Chemistry –
‘Normalize’ Input Structure
  Implement chemical rules

Math –
‘Canonicalize’ (label the atoms)
           Equivalent atoms get the same label

Format –
‘Serialize’ Labeled Structure
            Output as character string (‘name’)
20
InChI is a name or a string, and often a long string. It is not an ID or “registry number”.  It is not part of the IUPAC InChI project to develop such an ID.  However, other organizations are working on developing a unique ID that would be associated with a unique InChI. If/When such a process is completed it will be made available to the chemical community.
21
Conclusion

InChI means that you can create your own structure files and databases using a public domain open source system.  Using  InChI means you can freely exchange structure files with others within your organization and with any person or organization anywhere in the world knowing the structure name – the InChI – will be the same. You can search the Internet using an InChI knowing you find a match if it is there and not need to worry if it was coded differently by another person or program. InChI means you are no longer dependent on a proprietary system.
22
"TO USE InChI"
  •   TO USE InChI
23
Acknowledgements

Steve Bachrach, Evan Bolton, Steve Bryant, Denise Creech, Nick Day, Rene Deplanque, Guenter Grethe, Stevan Hanard, Martin Hicks, Sami Kassab, Beda Kosata, Gary Mallard, Randy Marcinko, Alan McNaught, Bill Milne, Miloslav Nic, Carmen Nitsche, Josep Prous, Chris Reed, Rich Roberts, Peter Murray-Rust, Henry Rzepa, Steve Stein, Peter Shepherd, Bill Town, Andrea Twiss-Brooks, Wendy Warr, and Ann Wolpert.