Notes
Outline
Open Source/Open Access/Open Data and the IUPAC  International Chemical
Identifier – (InChI)
Stephen Heller, Stephen Stein, & Dmitrii Tchekhovskoi
Physical & Chemical Properties Division
NIST
Gaithersburg, MD
srheller@nist.gov
The slides from this talk can be found at:

http://www.hellers.com/steve/pub-talks/acs-8-05/frame.html
Disclaimer

The opinions presented on these slides are those of the slides and not necessarily that of the speaker or any organization.  No animals were harmed in the preparation of this talk; however a few WWW sites were hit.
Internet Sources of Open Access Information
Peter Suber - SPARC
http://www.arl.org/sparc/soa/index.html
Harnad list http://www.cogsci.soton.ac.uk/~harnad/Hypermail/Amsci/index.html



CINF session on - Public Policy and International Science Issues
Key Factors:
1. Internet
2. Internet
3. Internet
4. Internet
The new Global World features
 Structural Changes in communications and interactions
Lack of Allegiances
Lack of Allegiances
Memberships in Professional Societies is declining because people interact and communicate differently – using the Internet. Trend is most obvious in younger scientists.
While AAAS has lost perhaps 20% of its membership in the past few years, the ACS drop, while smaller, is the same trend.
                              1998           2004
Full Rate               106,463       103, 227
Those who only remember the past are condemned to misread the future.

F. Zakaria, Newsweek, 8/15/05
The Internet has caused major changes for scientists around the world. The Open Access, Open Source, and Open Data projects are a result of this new world.  These projects are now growing and changing the way scientists so business and changing the businesses that scientists use.
International Cooperation – An Example
Open Data - NIH/Roadmap PubChem Project
Accepting  PubChem policies and plans and working with NIH to improve public health and fight deadly diseases:
ASINEX, BIND, BiopCyc, CMLD-BU, ChemBridge, ChemExper Chemical Dictionary, ChemIDplus, DPISMR, DTP/NCI, Elsevier, KEGG, MICAD, MMDB, MOLI, NCGC, NIAID, NIST, NIST Chemistry WebBook, NMRShiftDB, Nature Chemical Biology
Not accepting  PubChem policies or plans and not yet working with NIH:
ACS
"Timing is Everything“

In 1980 if a publisher went out business, disseminating scientific articles to the community would be very difficult.

In 2005 if a publisher went out of business, one would just expect Google to take over and deliver the manuscripts.
Scribes in the 15th century were not happy with Johann Gutenberg.

Publishers in the 21st century are not happy with Tim Berners-Lee

Organizations that fail to recognize and confront technological and market changes often tend to lose their positions, if not their organizations.  History is replete with such examples. In the 18th century the power looms replaced the handloom weavers, In the early 20th century the horse and buggy industry giving way to automobiles.  In the late 20th century the airplane replaced the train and boat for long distance traveling.  Now, at the start of the 21st century the technology of the Internet is threatening the way in which the 3+ century old scientific publishing industry and libraries which subscribe to scholarly publications have done business for many decades.
There is a Chinese expression for the blindness brought on by inside perspective: jing di zhi wa, "frog in the bottom of a well." The frog looks up and sees only a single circle of the sky; he thinks he sees clearly, but "he doesn't know how big heaven really is."
Rachel Dewoskin
Foreign Babes in Beijing, 2005

Summary:
A structural change in the way the new generation gets information
Why change is slow, but inevitable
Momentum – People are conservative and change slowly.  Even an airplane at 30,000 feet does fall straight into the sea…
A Canadian Airbus pilot was forced/able to fly his plane, with 304 people aboard,  for 113 miles without power or fuel before making a forced landing in the Azores.

August 2001
While a Porsche costs less than a Ferrari, it is still not cheap. (That is an ACS journal may cost less than one from Elsevier, but if you are small college, you can afford neither.)

Open Access, Open Source, Open Data will, in the long run, replace much, but not all, of costly information that scientists use.
System Problems
1. Costs are high
2. No cost for manuscript submission. Under ANY economic model the high volume of submissions generated by the submission via the Internet will drown any system.
3. Lack of leadership at research institutions to demand changes from researchers publication behavior.
4.  Difficulty to institute change
But there are some moving towards the future…
Nucleic Acids Research
Impact Factor -- 6.575

Overview of NAR’s Open Access model for 2005
From 1st January 2005, all articles published in NAR will be made freely available online immediately upon publication. This means that it will no longer be necessary to hold a subscription in order to read NAR online – content published in the journal will be easily accessible to everyone.

Our decision to implement an Open Access model for 2005 is based in part on a large-scale survey of NAR authors and reviewers. Between March and April 2004, over 1000 members of the journal’s community responded to our survey, with the majority supporting a move to full Open Access partially funded by author publication charges. We have also discussed possible models with representatives of the librarian community, who have expressed support for our experimentation with Open Access.

http://www3.oup.co.uk/nar/special/14/default.html
Slide 20
Slide 21
Slide 22
Slide 23
InChI

A project whose time has come.  The Internet, an international scientific body (IUPAC) and international cooperation (US, UK) has led to the speedy development, implementation, and use of InChI.

While InChI is a public domain system for creating a unique computer-readable identifier (“name”)  it is NOT a registry system.  InChI’s are created only by those who choose to adopt and use the algorithm.  Registry systems which index the literature are complimentary to any InChI databases that anyone creates.
Digital ‘Naming’ of Chemicals
Chemical structure is the true ‘identifier’
But, structure representations are not unique or convenient for computers.
So, convert structure to a unique ‘name’ by fixed algorithms
The IUPAC International Chemical Identifier (InChI)
Two Problems
Chemicals
Fast isomerization (tautomerization)
Ill-defined connectivity
Chemists
Differing conventions
Depends on discipline, education and convenience
Imprecision/uncertainty
3 Steps to InChI
Chemistry
‘Normalize’ Input Structure
Implement chemical rules
Math
‘Canonicalize’ (label the atoms)
Equivalent atoms get the same label
Format
‘Serialize’ Labeled Structure
Output as character string (‘name’)
Normalize
Simplify
Divide structure into ‘layers’
Each layer ‘refines’ structure
Ignore ‘Electron Density’
Use simple ‘connectivity’ only
Ignore bond type and electron location
Stereochemistry
sp2 and sp3 only
Free rotation around single bonds
No Z/E stereo for small rings (default)
Slide 29
Slide 30
InChI Capabilities
Identify compounds at the known level of detail
Convention-free (mostly)
Generate quickly from structure
Contains all essential connectivity information
Simple ASCII representation
InChI FAQ’s
How
Slide 33
Slide 34
Slide 35
Slide 36
Slide 37
Current InChI Project -1
Chemical Nomenclature and Structure Representation Division (VIII)
Number: 2004-039-1-800
Title: IUPAC International Chemical Identifier (InChI): promotion and extension
Task Group
Chairman: Alan McNaught
Members: Stephen R. Heller, Jaroslav Kahovec, Stephen Stein, Dmitrii Tchekhovskoi, and Andrey Yerin
Objective:
Following the launch of InChI version 1.0:
to promote its use throughout the chemical information community
to extend its applicability to include polymeric structures
to explore the need for other extensions, including the ability to handle Markush structures, and to include information on other attributes such as phases and excited states
Current InChI Project -2
Description:
Version 1.0 of the Identifier expresses chemical structures in a standard machine-readable format, in terms of atomic connectivity, tautomeric state, isotopes, stereochemistry, and electronic charge. It deals with neutral and ionic well-defined, covalently-bonded organic molecules, and also with inorganic, organometallic and coordination compounds.
We propose to promote actively the use of the algorithm and its associated implementations to developers of commercial chemical software, database compilers and publishers of chemical information, in order to enable sharing of molecular information throughout the worldwide community of chemical scientists.
We propose also to extend the applicability of the Identifier to polymeric structures, and to explore the need for and the practicality of an extension to cover Markush structures.
In addition, we will evaluate the need for inclusion of information on other attributes such as phases and excited states, and take steps to include such information if appropriate.
Current InChI Project -3
Progress:
Version 1 of IUPAC's International Chemical Identifier (InChI) has been released in April 2005; software, documentation, source code and licensing conditions are available from the IUPAC website at www.iupac.org/inchi
An InChI FAQ presented by Nick Day (Unilever Centre for Molecular Informatics, Cambridge University) is available from http://wwmm.ch.cam.ac.uk/inchifaq/
May 2005 update
To enable development of InChI facilities and applications in an Open Source context, a project to encompass this work has been registered with SourceForge.net (see http://sourceforge.net/projects/inchi); people wishing to participate should contact the project administrator (mcnaughta@rsc.org) or the IUPAC Secretariat (secretariat@iupac.org). To receive and discuss proposals for InChI enhancements, an internet listserver has also been established; people wishing to participate in these discussions should contact Alan McNaught (mcnaughta@rsc.org).
InChI References/Publications
1. Sophie Rovner, C&E News, ” CHEMICAL 'NAMING' METHOD UNVEILED ”, August 22, 2005
Volume 83, Number 34, pp. 39-40
2. International chemical identifier goes online, Chem. World, 16 May 2005
3. M.D. Prasanna, J. Vondrasek, A. Wlodawer and T.N. Bhat, Application of InChI to Curate, Index, and Query 3-D Structures, Proteins: Structure, Function, and Bioinformatics, 2005, 60, 1-4
4. Enhancement of the chemical semantic web through the use of InChI identifiers, S.J. Coles, N.E. Day, P. Murray-Rust, H.S. Rzepa and Y. Zhang, Org. Biomol. Chem., 2005, 3(10), 1832-1834
5. InChI FAQ, by Nick Day (Unilever Centre for Molecular Informatics, Cambridge University)
6.Representation and Use of Chemistry in the Global Electronic Age, P. Murray-Rust, H.S. Rzepa, S.M. Tyrrell and Y. Zhang, Org. Biomol. Chem., 2004, 3192-3203 [www.ch.ic.ac.uk/rzepa/obc/]
7.That INChI feeling, Reactive Reports, issue 40, Sep 2004
8.Unique labels for compounds, Chem. & Eng. News, 2 Dec 2002
\
9. Chemists synthesize a single naming system, Nature, 23 May 2002
10.That IChI feeling ... The Alchemist, 24 Apr 2002
11.What's in a Name? The Alchemist, 21 Mar 2002
  12. Stephen E. Stein, Stephen R. Heller, and Dmitrii Tchekhovskoi, An Open Standard for Chemical Structure Representation: The IUPAC Chemical Identifier,
                                                  Proceedings of the 2003 International Chemical Information Conference (Nimes), Infonortics, pp. 131-143.
Early InChI Adopters
NIST – 150,000
NIH/NCBI/PubChem project – 4,000,000+
ISI – 2+ million structures
NCI Database – 23 million+
EPA –DSSTox database – 1450
KEGG database – 9584
UCSF ZINC – 3.3million
Slide 43
Slide 44
Future
Future versions of InChI, for example, could include phase information and crystal structure, conformations, electronic states and additional classes of stereochemistry.
First additional project: Investigate adding polymers to InChI
Acknowledgements
Steve Bachrach, Steve Bryant, Denise Creech, Rene Deplanque, Guenter Grethe, Stevan Hanard, Sami Kassab, Gary Mallard, Randy Marcinko, Alan McNaught, Bill Milne, Carmen Nitsche, Chris Reed, Rich Roberts, Peter Murray-Rust, Henry Rzepa, Steve Stein, Peter Shepherd, Bill Town, Andrea Twiss-Brooks, and Ann Wolpert