Notes
Outline
Open Source/Open Access and the IUPAC  International Chemical
Identifier – InChI
Stephen Heller, Stephen Stein, & Dmitrii Tchekhovskoi
Physical & Chemical Properties Division
NIST
Gaithersburg, MD
srheller@nist.gov
The slides from this talk can be found at:
http://www.hellers.com/steve/pub-talks/noord605/frame.html
Disclaimer
The opinions presented on these slides are those of the slides and not necessarily those of the speaker.
No animals were harmed in the preparation of this talk; however a few WWW sites were hit. This talk conforms to PETA & NIH treatment of human subjects guidelines. No stem cells were used to prepare this talk.
These slides were made from 100% recycled electrons.
This will be a well balanced presentation. I have a chip on both shoulders.
Outline of Today’s talk
Introduction/Background
Open Access
Open Source Chemical Structures – InChI
Summary
Internet Sources of Open Access Information
Peter Suber - SPARC
http://www.arl.org/sparc/soa/index.html
Harnad list http://www.cogsci.soton.ac.uk/~harnad/Hypermail/Amsci/index.html
Requirement:
To communicate and provide a permanent record
of scientific research.

Solution:

Has varied over time.
Scribes in the 15th century were not happy with Johann Gutenberg.

Publishers in the 21st century are not happy with Tim Berners-Lee
Noah rule:
 
Predicting rain doesn't count; building arks does.

Warren Buffett
Dark Blue line: Talks      Pink Line: Manuscripts
Organizations that fail to recognize and confront technological and market changes often tend to lose their positions, if not their organizations.  History is replete with such examples. In the 18th century the power looms replaced the handloom weavers, In the early 20th century the horse and buggy industry giving way to automobiles.  In the late 20th century the airplane replaced the train and boat for long distance traveling.  Now, at the start of the 21st century the technology of the Internet is threatening the way in which the 3+ century old scientific publishing industry and libraries which subscribe to scholarly publications have done business for many decades.

X-rays will prove to be a hoax
Lord Kelvin, 1883

Radio has no future
Lord Kelvin, 1897

Fooling around with alternating current is a waste of time.
Nobody will use it, ever.
Thomas Edison , circa 1900

Stocks have reached what looks like a permanently high plateau.
Prof. Irving Fisher (Yale), 10/17/1929

I think there is a world market for maybe five computers.
Thomas Watson, Chairman of IBM, 1943



We don’t like their sound. Groups of Guitars are on their way out.
Decca Records executive, 1962

There is no reason for any individual to have a computer
in their home.
Ken Olson
President Digital Equipment Corp.
1977

640K ought to be enough for anybody.
Bill Gates, 1981

 Open Access is evil
Commercial & Society Publishers, 2005
I want to assure you that I have never felt better about the prospects for the company.

Enron CEO Ken Lay, after selling $160 million of his own company stock
(August 14, 2001)




Companies come and go. It's part of the genius of capitalism.

Paul O'Neill, Treasury secretary, on the collapse of Enron (January 14,  2002)
Slide 14
Slide 15
Slide 16
Slide 17
Slide 18
..publishers have grown fat by charging libraries hundreds or thousands of dollars a year for subscriptions to printed artifacts that might not contain information of real importance.

Harry Collier, Digital Publishing Strategies, 11/97, page 16
System Problems
Costs are high
No cost for manuscript submission. Under ANY economic model the high volume of submissions generated by the submission via the Internet will drown any system.
Lack of leadership at research institutions to demand changes from researchers publication behavior.
Difficulty to institute change
System Requirements
The scholarly community needs organizations to accept, review,  disseminate, and archive manuscripts
Only institutions have infinite lifetimes, humans don’t (i.e., self archiving is nice, but too finite for civilization to benefit)
OA Players
Researchers
Publishers
Libraries
Stevan Harnad
Slide 23
Steven Hanard
Three Examples of OA Journals and OA Policies
Nucleic Acids Research
Impact Factor -- 6.575
Overview of NAR’s Open Access model for 2005
From 1st January 2005, all articles published in NAR will be made freely available online immediately upon publication. This means that it will no longer be necessary to hold a subscription in order to read NAR online – content published in the journal will be easily accessible to everyone.
Our decision to implement an Open Access model for 2005 is based in part on a large-scale survey of NAR authors and reviewers. Between March and April 2004, over 1000 members of the journal’s community responded to our survey, with the majority supporting a move to full Open Access partially funded by author publication charges. We have also discussed possible models with representatives of the librarian community, who have expressed support for our experimentation with Open Access.
http://www3.oup.co.uk/nar/special/14/default.html
New ACS Policy –

Open Access will come as soon as:
Slide 28
Slide 29
Slide 30
Slide 31
Discussions are underway to use InChI as the chemical identifier in the Beilstein Journal of Organic Chemistry (BJOC)
What about funding?
Economics
Scholarly journals -The only item in the USA whose cost is rising faster than health care.
Ann Wolpert, MIT
For those who are not sure Darwin was correct : there must be a God – no human could have ever created such a dysfunctional system.
The Future
Between researchers putting their results on the web and Google/Yahoo/Microsoft developing ways to search text and chemical structures all non-copyright, non-proprietary information will be readily available. Who knows, Google might even buy all of Elsevier’s back-file content one day.
The Future
In  the Hitchhikers Guide to the Galaxy,  the Earth was vaporized to make way for a new inter-galatic highway which was needed. Destroying earth, was, to quote Defense Secretary Rumsfeld, “collateral damage”. Perhaps we will be saying that of publishers one day soon as well.
Nowhere in the US or the proposed EC Constitution is there a  guaranteed right for publishers to have 40% profit margins or even remain in business.

History of abstracting services*:

Chemisches Zentralblatt:1856 –1969
British Abstracts: 1849 – 1953
Chemical Abstracts: 1907 - ??
Google 2004 - ???

* The Evolution of the Secondary Literature in Chemistry - Helen Schofield
What is the state bird of Hawaii ?
Nene
Open Access for Chemical Structures
The IUPAC Chemical Identifier Project – InChI- An Open Access/Open Source project of IUPAC
Slide 42
Standard Structure Representation is Not New
The standard molecular data format (SMD format) as an integration tool in computer chemistry
H. Bebak, C. Buse, W. T. Donner, P. Hoever, H. Jacob, H. Klaus, J. Pesch, J. Roemelt, P. Schilling, et al.;
J. Chem. Inf. Comput. Sci.; 1989; 29(1); 1-5.
Digital ‘Naming’ of Chemicals
Chemical structure is the true ‘identifier’
But, structure representations are not unique or convenient for computers.
So, convert structure to a unique ‘name’ by fixed algorithms
The IUPAC International Chemical Identifier (InChI)
Two Problems
Chemicals
Fast isomerization (tautomerization)
Ill-defined connectivity
Chemists
Differing conventions
Depends on discipline, education and convenience
Imprecision/uncertainty
3 Steps to InChI
Chemistry
‘Normalize’ Input Structure
Implement chemical rules
Math
‘Canonicalize’ (label the atoms)
Equivalent atoms get the same label
Format
‘Serialize’ Labeled Structure
Output as character string (‘name’)
Normalize
Simplify
Divide structure into ‘layers’
Each layer ‘refines’ structure
Ignore ‘Electron Density’
Use simple ‘connectivity’ only
Ignore bond type and electron location
Stereochemistry
sp2 and sp3 only
Free rotation around single bonds
No Z/E stereo for small rings (default)
Slide 48
Slide 49
Slide 50
InChI Capabilities
Identify compounds at the known level of detail
Convention-free (mostly)
Generate quickly from structure
Contains all essential connectivity information
Simple ASCII representation
Slide 52
InChI FAQ’s
Available from Nick Day, Cambridge University, UK:
ned24@cam.ac.uk
Slide 54
Slide 55
Slide 56
Slide 57
Slide 58
Slide 59
Slide 60
https://sourceforge.net/projects/inchi

The IUPAC International Chemical Identifier (InChI) is a protocol for converting a chemical structure (connection table) to a unique, predictable ASCII character string. This project will develop facilities for using and applying the InChI algorithm.

Intended Audience: Science/Research
License: Artistic License
Topic: Chemistry


Project UNIX name: inchi
Registered: 2005-04-16 03:40
Activity Percentile (last week): 63.73
View project activity statistics
View list of RSS feeds available for this project
Current InChI Project -1
Chemical Nomenclature and Structure Representation Division (VIII)
Number: 2004-039-1-800
Title: IUPAC International Chemical Identifier (InChI): promotion and extension
Task Group
Chairman: Alan McNaught
Members: Stephen R. Heller, Jaroslav Kahovec, Stephen Stein, Dmitrii Tchekhovskoi, and Andrey Yerin
Objective:
Following the launch of InChI version 1.0:
to promote its use throughout the chemical information community
to extend its applicability to include polymeric structures
to explore the need for other extensions, including the ability to handle Markush structures, and to include information on other attributes such as phases and excited states
Current InChI Project -2
Description:
Version 1.0 of the Identifier expresses chemical structures in a standard machine-readable format, in terms of atomic connectivity, tautomeric state, isotopes, stereochemistry, and electronic charge. It deals with neutral and ionic well-defined, covalently-bonded organic molecules, and also with inorganic, organometallic and coordination compounds.
We propose to promote actively the use of the algorithm and its associated implementations to developers of commercial chemical software, database compilers and publishers of chemical information, in order to enable sharing of molecular information throughout the worldwide community of chemical scientists.
We propose also to extend the applicability of the Identifier to polymeric structures, and to explore the need for and the practicality of an extension to cover Markush structures.
In addition, we will evaluate the need for inclusion of information on other attributes such as phases and excited states, and take steps to include such information if appropriate.
Current InChI Project -3
Progress:
Version 1 of IUPAC's International Chemical Identifier (InChI) has been released in April 2005; software, documentation, source code and licensing conditions are available from the IUPAC website at www.iupac.org/inchi
An InChI FAQ presented by Nick Day (Unilever Centre for Molecular Informatics, Cambridge University) is available from http://wwmm.ch.cam.ac.uk/inchifaq/
May 2005 update
To enable development of InChI facilities and applications in an Open Source context, a project to encompass this work has been registered with SourceForge.net (see http://sourceforge.net/projects/inchi); people wishing to participate should contact the project administrator (mcnaughta@rsc.org) or the IUPAC Secretariat (secretariat@iupac.org). To receive and discuss proposals for InChI enhancements, an internet listserver has also been established; people wishing to participate in these discussions should contact Alan McNaught (mcnaughta@rsc.org).
InChI References/Publications
1. International chemical identifier goes online, Chem. World, 16 May 2005
2. M.D. Prasanna, J. Vondrasek, A. Wlodawer and T.N. Bhat, Application of InChI to Curate, Index, and Query 3-D Structures, Proteins: Structure, Function, and Bioinformatics, 2005, 60, 1-4
3. Enhancement of the chemical semantic web through the use of InChI identifiers, S.J. Coles, N.E. Day, P. Murray-Rust, H.S. Rzepa and Y. Zhang, Org. Biomol. Chem., 2005, 3(10), 1832-1834
3. InChI FAQ, by Nick Day (Unilever Centre for Molecular Informatics, Cambridge University)
4.Representation and Use of Chemistry in the Global Electronic Age, P. Murray-Rust, H.S. Rzepa, S.M. Tyrrell and Y. Zhang, Org. Biomol. Chem., 2004, 3192-3203 [www.ch.ic.ac.uk/rzepa/obc/]
5.That INChI feeling, Reactive Reports, issue 40, Sep 2004
6.Unique labels for compounds, Chem. & Eng. News, 2 Dec 2002
\
7. Chemists synthesize a single naming system, Nature, 23 May 2002
8.That IChI feeling ... The Alchemist, 24 Apr 2002
9.What's in a Name? The Alchemist, 21 Mar 2002
  10. Stephen E. Stein, Stephen R. Heller, and Dmitrii Tchekhovskoi, An Open Standard for Chemical Structure Representation: The IUPAC Chemical Identifier, in Proceedings of the 2003 International Chemical Information Conference (Nimes), Infonortics, pp. 131-143.
Slide 66
Early InChI Adaptors
NIST – 150,000 structures
NIH/NCBI/PubChem project – 800,000+ structures
ISI – 2+ million structures
NCI Database – 23 million+ structures
EPA –DSSTox database – 1450 structures
KEGG database – 9584 structures
UCSF ZINC – 3.3 million structures
Slide 68
Slide 69
Future
Future versions of InChI, for example, could include phase information and crystal structure, conformations, electronic states and additional classes of stereochemistry.
First additional project: Investigate adding polymers to InChI
Don’t give up –

Moses was once a basket case
I would never die for my beliefs because I might be wrong.

    Bertrand Russell
Acknowledgements
I really think my friends would prefer if I left their names off this slide.
Slide 74
Acknowledgements
Steve Bachrach, Mila Becker, Pieter Bolman, Bob Bovenschulte, Steve Bryant, Alice Cooper,  Rene Deplanque, Guenter Grethe, Stevan Hanard, Sami Kassab, Gary Mallard, Randy Marcinko, Alan McNaught, Bill Milne, Carmen Nitsche, Chris Reed, Rich Roberts, Peter Murray-Rust, Henry Rzepa, Steve Stein,
Peter Shepherd, Bill Town, Andrea Twiss-Brooks, Ann Wolpert
Slide 76