Notes
Outline
Open Access/Open Source/Open Data and the IUPAC International Chemical Identifier (INChI)
Stephen Heller*, Stephen Stein, & Dmitrii Tchekhovskoi
Physical & Chemical Properties Division
NIST
Gaithersburg, MD 20899
*affiliation for InChI project work – This not a NIST presentation
steve@hellers.com
The slides from this talk can be found at:

http://www.hellers.com/steve/pub-talks/goslar-1105/frame.html
Disclaimer
The opinions presented on these slides are those of the slides and not necessarily those of the speaker.
No animals were harmed in the preparation of this talk; however a few WWW sites were hit. This talk conforms to PETA & NIH treatment of human subjects guidelines.
These slides were made from 100% recycled electrons.
There are no George W. Bush jokes in this presentation.
This will be a well balanced presentation. I have a chip on both shoulders.



Critical Factors Affecting Chemical Information
Key Factors:
1. Internet
2. Internet
3. Internet
4. Internet
5. Internet
The Internet has caused major changes for scientists around the world. The Open Access, Open Source, and Open Data projects are a result of this new world.  These projects are now growing and changing the way scientists do business and changing the businesses that scientists use.
Outline
Overview of the new world
Open Access
Open Data
Open Source
Overview
The Internet has created ONE marketplace. There are no longer mini-marketplaces. The way to make things work is to link these mini-markets.
The new Global World features
 Structural Changes in communications and interactions
Lack of Allegiances
Lack of Allegiances
Memberships in Professional Societies is declining because people interact and communicate differently – using the Internet. Trend is most obvious in younger scientists.
While AAAS has lost perhaps 20% of its membership in the past few years, the ACS drop, while smaller, is the same trend.
                                   1998           2004
Full Paid Rate               106,463       103, 227
Those who only remember the past are condemned to misread the future.

F. Zakaria, Newsweek, 8/15/05
The new world requires new business models.
Information is not free. Someone needs to pay. Databases are never “finished”. Someone needs to update, add to, and /or correct databases.  Software is never “finished”. Someone needs to correct errors,  add new features, and maintain the software as new hardware and computer networks become available.
The Internet has changed the way databases are created, as now virtually everything comes in computer readable form.  This allows for lower costs to obtain raw data/information.  But the data stills requires manual, labor-intensive efforts to be curated or checked.
"Timing is Everything“

In 1980 if a publisher went out business, disseminating scientific articles to the community would be very difficult.

In 2005 if a publisher went out of business, one would just expect Google to take over and deliver the manuscripts.
There is a Chinese expression for the blindness brought on by inside perspective: jing di zhi wa, "frog in the bottom of a well." The frog looks up and sees only a single circle of the sky; he thinks he sees clearly, but "he doesn't know how big heaven really is."

Rachel Dewoskin
Foreign Babes in Beijing, 2005

Summary:
A structural change in the way the new generation gets information
Scribes in the 15th century were not happy with Johann Gutenberg.

Publishers in the 21st century are not happy with Tim Berners-Lee

Organizations that fail to recognize and confront technological and market changes often tend to lose their positions, if not their organizations.  History is replete with such examples. In the 18th century the power looms replaced the handloom weavers, In the early 20th century the horse and buggy industry giving way to automobiles.  In the late 20th century the airplane replaced the train and boat for long distance traveling.  Now, at the start of the 21st century the technology of the Internet is threatening the way in which the 3+ century old scientific publishing industry and libraries which subscribe to scholarly publications have done business for many decades.

X-rays will prove to be a hoax
Lord Kelvin, 1883

Radio has no future
Lord Kelvin, 1897

Fooling around with alternating current is a waste of time.
Nobody will use it, ever.
Thomas Edison , circa 1900

I think there is a world market for maybe five computers.
Thomas Watson, Chairman of IBM, 1943



There is no reason for any individual to have a computer
in their home.
Ken Olson
President Digital Equipment Corp.
1977

640K ought to be enough for anybody.
Bill Gates, 1981

 Open Access is evil
Commercial & Society Publishers, 2005
The Future vs. The Past
Hindsight is Easy
Creative vs. Custodial
Vision vs. Blindness
Constructive vs. Destructive
Why change is slow, but inevitable
Momentum – People are conservative and change slowly.  Even an airplane at 30,000 feet does not fall straight into the sea…
A Canadian Airbus pilot was forced/able to fly his plane, with 304 people aboard,  for 113 miles without power or fuel before making a forced landing in the Azores.

August 2001
While a Porsche costs less than a Ferrari, it is still not cheap. (That is an ACS journal may cost less than one from Elsevier, but if you are small college, you can afford neither.)

Open Access, Open Source, Open Data will, in the long run, replace much, but not all, of costly information that scientists use.
Open Data
Another Disclaimer:


Any resemblance to real persons, living or dead, on any of these slides is purely
coincidental.
International Cooperation – An Example
Open Data - NIH/Roadmap PubChem Project
Accepting  PubChem policies and plans and working with NIH to improve public health and fight deadly diseases:
ASINEX, BIND, BiopCyc, CMLD-BU, ChemBank, ChemBridge, ChemExper Chemical Dictionary, ChemIDplus, DPISMR, DTP/NCI, Elsevier, KEGG, MICAD, MMDB, MOLI, NCGC, NIAID, NIST, NIST Chemistry WebBook, NMRShiftDB, Nature Chemical Biology
Not accepting  PubChem policies or plans and not yet working with NIH:
ACS
CAS Management?
Marie – “Let them eat cake” Antoinette – ACS Management??
Open Access
System Problems
1. Costs are high due to legacy expenses.
2. No cost for manuscript submission. Under ANY economic model the high volume of submissions generated by the submission via the Internet will drown any system.
3. Lack of leadership at research institutions to demand changes from researchers publication behavior.
4.  Difficulty to institute change
Internet Sources of Open Access Information
Peter Suber - SPARC
http://www.arl.org/sparc/soa/index.html
Harnad list http://www.cogsci.soton.ac.uk/~harnad/Hypermail/Amsci/index.html
Requirement:
To communicate and provide a permanent record
of scientific research.

Solution:

Has varied over time.
Scribes in the 15th century were not happy with Johann Gutenberg.

Publishers in the 21st century are not happy with Tim Berners-Lee
In the early 1990s, Sebastian Junger wrote a book called The Perfect Storm which described a weather event that had never occurred before in recorded history.  A combination of rain, wind, cold air from Canada and warm air from the Atlantic Ocean created this event, stemming from two very dissimilar air masses.  In his book, The Innovators Dilemma, Clayton
Christensen describes what happens when disruptive changes in technology create the environment for a break-though or change that could never happen on its own.  Open Access is the result of Internet and the financial environment (the so-called “serials-crises”) in scholarly community libraries which have created another "perfect storm".
Organizations that fail to recognize and confront technological and market changes often tend to lose their positions, if not their organizations.  History is replete with such examples. In the 18th century the power looms replaced the handloom weavers, In the early 20th century the horse and buggy industry giving way to automobiles.  In the late 20th century the airplane replaced the train and boat for long distance traveling.  Now, at the start of the 21st century the technology of the Internet is threatening the way in which the 3+ century old scientific publishing industry and libraries which subscribe to scholarly publications have done business for many decades.
Open Access

An Open Access Publication is one that meets the following two conditions:
The author(s) and copyright holder(s) grant(s) to all users a free, irrevocable, worldwide, perpetual right of access to, and a license to copy, use, distribute, transmit and display the work publicly and to make and distribute derivative works, in any digital medium for any responsible purpose, subject to proper attribution of authorship, as well as the right to make small numbers of printed copies for their personal use.

A complete version of the work and all supplemental materials, including a copy of the permission as stated above, in a suitable standard electronic format is deposited immediately upon initial publication in at least one online repository that is supported by an academic institution, scholarly society, government agency, or other well-established organization that seeks to enable open access, unrestricted distribution, interoperability, and long-term archiving (for the biomedical sciences, PubMed Central is such a repository).

From PLOS web site
Basic Goal or Reason for Publishing:

Communicate to others what you have found



Real (Secondary) Goal:

Improve your position in life (money, recognition, promotion, etc.)
Most scientific manuscripts are write only-

so while you get what you need, you still pay for everything
…Publishers have grown fat by charging libraries hundreds or thousands of dollars a year for subscriptions to printed artifacts that might not contain information of real importance.

Harry Collier, Digital Publishing Strategies, 11/97, page 16
System Requirements
The scholarly community needs organizations to accept, review,  disseminate, and archive manuscripts
Only institutions have infinite lifetimes, humans don’t (i.e., self archiving is nice, but too finite for civilization to benefit)
There must be a God – no human could have ever created such a dysfunctional system.
Profit = Revenue – Expenses

                   1980                               2005
Data Entry                           100                                  10
Data Curation                        20                                  50
Software/Hardware             100                                  50
Profit                                     22  (10%)                      22  (20%)               

Cost to user                         242                                 132

Net Result: A happier customer (and a more profitable operation)
OA Players
Researchers
Publishers
Libraries
Stevan Harnad
OA Issues
Peer Review
Archiving
Economics
Peer Review
The worst of system, except for all the others.
With apologies to Winston Churchill
Peer Review:


Quantity has a quality all of its own

Joseph Stalin

or
Peer review is nice, but
Reproducibility is what counts in science.
Peer review has nothing to do with OA.
Peer Review
Peer Review is about to collapse under the weight of too many short (LPU’s - least publishable unit) papers, too many poor science papers, and too many poorly written manuscripts – all of which are too easily submitted via the Internet.
.
Peer Review Problems – Poor Manuscripts
For example:
“Two member-ed unsaturated rings”
Part 1- Synthesis
Part 2 – Nitrogen derivatives
Part 3- Sulfur derivatives
Authors:
G. Marx, H. Marx, & Z. Marx, Freedonia Academy of Sciences
The solution -- independent of subscription/OA model -- is to charge for submission of manuscripts, and charge a second fee if accepted under the OA model
Archiving Examples
Vatican Library – 4th century
Bibliotheque Nationale de France -1367
National Library of Sweden – 1568
Harvard University - 1638
German State Library in Berlin -1661
National Library of Spain -1711
British Library  – 1753
US Library of Congress – 1800
ACS Electronic Journals – 1996
Economics
What about funding?
Open and Access and Publishers –

 Why do they think evolution does not apply to them?
Nowhere in the  US Constitution is there a  guaranteed right for publishers to have 40% profit margins or even remain in business.

History of abstracting services*:

Chemisches Zentralblatt:1856 –1969
British Abstracts: 1849 – 1953
Chemical Abstracts: 1907 - ??
Google 2004 - ???

* The Evolution of the Secondary Literature in Chemistry - Helen Schofield
Open Access will:
Provide global, universal free access to information
Resolve the serial & budget crises at libraries
Accelerate scientific progress and research
Enhance research productivity
Improve Quality Assurance
Grow hair on bald spots - ?
Economics
The financial models from the publishers have changed due to the Internet. They have replaced purchases and copyright and fair-use with leases, and contracts.
Economics
Cost of publishing an OA article is US $100 - $15,000
(All financial numbers have been audited and approved by Arthur Anderson, Inc.)
Economics
Publishing Costs – Subscription model:
Editorial Staff
Sales
Marketing
Legal – Contracts, Copyright
IT/Computer Systems
Librarians Activities – Past Decade
Whine about increasing prices
Reduce journal and book purchases
Attempted to educate researchers about pricing issues
Provided journals in electronic form to researchers’ desktop
Provided electronic document delivery
Publishers Activities – Past Decade
Keep the cash flowing in:
Raise prices
Replace copyright with contracts/licenses
Object to any changes (e.g., Open Access)
Suggest various doomsday scenarios  to any change from the outside
Provide content in electronic form
Provide archives in electronic form
Researchers Activities – Past Decade
Business as usual – publish wherever they want
No change in where or how (e.g., use features of electronic media to enhance manuscript) they publish
Economics
What are the library infrastructure costs ?
Purchasing, licensing agreements (staff size including lawyers), inventory, budgeting for journal/book reductions, document delivery, interlibrary loans, etc.
What costs disappear with OA?
The basic question that really needs to be answered:

How much money do you REALLY need to run the scholarly publishing system – and how do you raise this money?

Where can unnecessary costs be squeezed out of the system by all parties?
Researchers need to stop publishing partial results on an hourly/ daily/weekly basis.

Journals need to charge authors for frivolous submissions. Peer review is not free now and never will be.
PLoS was a great cause and it attracted nearly 34,000 signatures from
scientists in 180 countries. But, while a small handful of publishers
complied, most blithely ignored the PLoS letter. Worse, most of the
scientist signatories were happy to forswear their own petition and
continued publishing in the very journals that had turned a deaf ear to
their request.

Richard Poynder
http://www.infotoday.com/it/oct04/poynder.shtml


Of the 34,000 signatories about 34 have actually published in an OA journal.  You can lead a horse to water, but you can't make him/her drink
Dark Blue line: Talks      Pink Line: Manuscripts
The real evil ones:

The self-centered, egotistical, and pampered researchers.
Solution - Prediction
Provosts at universities and college will mandate researchers put up their publications either on their institution web site or one or more public sites – libraries, universities, etc.
Don’t give up –

Moses was once a basket case
The Future (1)
Between researchers putting their results on the web and Google/Yahoo/Microsoft developing ways to search text and chemical structures all non-copyright, non-proprietary information will be readily available. Who knows, Google might even buy all of Elsevier’s back-file content one day.
The Future (2)
In  the Hitchhikers Guide to the Galaxy,  the Earth was vaporized to make way for a new inter-galatic highway which was needed. Destroying earth, was, to quote Defense Secretary Rumsfeld, “collateral damage”. Perhaps we will be saying that of publishers one day soon as well.
Summary
There is a problem with too many manuscripts and how to publish them (peer review) and how to make them available.  OA may be smoke and mirrors, but where there is smoke there usually is fire.  And someone will put out the fire. I am betting on Goggle or some version of it to do so.
But there are some moving towards the future…
Nucleic Acids Research
Impact Factor – 7.260 (2004)

1. Institutional membership of NAR means that corresponding authors based at the member institution will qualify for substantially discounted NAR Open Access publication charges ($950 compared with $1900 per article in 2006).

2. Online access free of charge in 2006. Please note that a 2006 print subscription does not include institutional membership.

3. Online access will be completely free of charge in 2005. A print subscription or institutional membership provides discounted publication charges for corresponding authors based at the member institution. See www.nar.oupjournals.org/openaccess
and
http://www3.oup.co.uk/nar/special/14/default.html
 PNAS impact factor is 10.5 for 2004

PNAS Open Access Option
Beginning in 2005, each PNAS Institutional Site License (online subscription) will automatically include an Institutional Open Access Membership. Authors from institutions with 2005 Site Licenses/Open Access Memberships are entitled to a 25% discount off the PNAS Open Access Fee (regularly $1000) to make their papers immediately free online. PNAS offers this plan without increasing site license rates over 2004.

Authors
Authors of accepted manuscripts who are interested in publishing their article as Open Access should confirm their subscription status with their institutional librarian. If their institution has a 2005 site license, the author should note the reduced fee ($750) on the PNAS billing forms included with the author proofs.

Librarians
A number of librarians have communicated to the PNAS Office that they support this new Open Access initiative, and we hope that it will provide an incentive for institutions to adopt site licenses, to which we have given added value. Please help us inform authors at your institution about the PNAS Open Access Option and the 2005 Site License/Open Access Membership discount. PNAS is a break-even operation and relies about equally on author fees and on subscription fees to cover its operating costs.
Slide 74
Open Source
InChI

A project whose time has come.  The Internet, an international scientific body (IUPAC) and international cooperation (US, UK) has led to the speedy development, implementation, and use of InChI.

While InChI is a public domain system for creating a unique computer-readable identifier (“name”)  it is NOT a registry system.  InChI’s are created only by those who choose to adopt and use the algorithm.  Registry systems which index the literature are complimentary to any InChI databases that anyone creates.
Unique identifier for chemical structures

If we are to be able to find chemistry on the web then we need a primary key/unique identifier to locate the data.

InChI = a revolutionary new approach to indexing full chemical structure:

1. Not dependent on specialised search software
2. Compatible with XML, HTML, database fields etc
3. Robust when deployed on Web
4. Open and International (IUPAC)

(Slide from Nick Day & Peter Murray-Rust/Cambridge)
Digital ‘Naming’ of Chemicals
Chemical structure is the true ‘identifier’
But, structure representations are not unique or convenient for computers.
So, convert structure to a unique ‘name’ by fixed algorithms
The IUPAC International Chemical Identifier (InChI)
Slide 79
Two Problems
Chemicals
Fast isomerization (tautomerization)
Ill-defined connectivity
Chemists
Differing conventions
Depends on discipline, education and convenience
Imprecision/uncertainty
3 Steps to InChI
Chemistry
‘Normalize’ Input Structure
Implement chemical rules
Math
‘Canonicalize’ (label the atoms)
Equivalent atoms get the same label
Format
‘Serialize’ Labeled Structure
Output as character string (‘name’)
Normalize
Simplify
Divide structure into ‘layers’
Each layer ‘refines’ structure
Ignore ‘Electron Density’
Use simple ‘connectivity’ only
Ignore bond type and electron location
Stereochemistry
sp2 and sp3 only
Free rotation around single bonds
No Z/E stereo for small rings (default)
Slide 83
Slide 84
Slide 85
InChI Capabilities
Identify compounds at the known level of detail
Convention-free (mostly)
Generate quickly from structure
Contains all essential connectivity information
Simple ASCII representation
Slide 87
Slide 88
Slide 89
Slide 90
InChI References/Publications
1. Sophie Rovner, C&E News, ” CHEMICAL 'NAMING' METHOD UNVEILED ”, August 22, 2005
Volume 83, Number 34, pp. 39-40
2. International chemical identifier goes online, Chem. World, 16 May 2005
3. M.D. Prasanna, J. Vondrasek, A. Wlodawer and T.N. Bhat, Application of InChI to Curate, Index, and Query 3-D Structures, Proteins: Structure, Function, and Bioinformatics, 2005, 60, 1-4
4. Enhancement of the chemical semantic web through the use of InChI identifiers, S.J. Coles, N.E. Day, P. Murray-Rust, H.S. Rzepa and Y. Zhang, Org. Biomol. Chem., 2005, 3(10), 1832-1834
5. InChI FAQ, by Nick Day (Unilever Centre for Molecular Informatics, Cambridge University)
6.Representation and Use of Chemistry in the Global Electronic Age, P. Murray-Rust, H.S. Rzepa, S.M. Tyrrell and Y. Zhang, Org. Biomol. Chem., 2004, 3192-3203 [www.ch.ic.ac.uk/rzepa/obc/]
7.That INChI feeling, Reactive Reports, issue 40, Sep 2004
8.Unique labels for compounds, Chem. & Eng. News, 2 Dec 2002
\
9. Chemists synthesize a single naming system, Nature, 23 May 2002
10.That IChI feeling ... The Alchemist, 24 Apr 2002
11.What's in a Name? The Alchemist, 21 Mar 2002
  12. Stephen E. Stein, Stephen R. Heller, and Dmitrii Tchekhovskoi, An Open Standard for Chemical Structure Representation: The IUPAC Chemical Identifier,
                                                  Proceedings of the 2003 International Chemical Information Conference (Nimes), Infonortics, pp. 131-143.
Early InChI Adopters
NIST – 150,000
NIH/NCBI/PubChem project – 5.2 million+
IBM – 1.6+ million
ISI – 2+ million
NCI Database – 23 million+
EPA –DSSTox database – 1450
KEGG database – 9584
UCSF ZINC – 3.3million
SciTegic use of InChI: A compound acquisition protocol in which the InChI’s are used to checkfor duplicates between the vendor library and corporate collection - the InChI’s are
calculated for both input streams and then the merge data component used to compare
the InChIs and collapse all the molecules with the same InChI onto a single record
- it has found about 2000 molecules in Maybridge that we already in our "corporate
collection"
Slide 94
Slide 95
Slide 96
Slide 97
Slide 98
Slide 99
Future
Future versions of InChI, for example, could include phase information and crystal structure, conformations, electronic states and additional classes of stereochemistry.
First additional project: Investigate adding polymers to InChI
Acknowledgements
I really think my friends would prefer if I left their names off this slide.
Acknowledgements
Steve Bachrach, Steve Bryant, Denise Creech, Nick Day, Rene Deplanque, Guenter Grethe, Stevan Hanard, Sami Kassab, Gary Mallard, Randy Marcinko, Alan McNaught, Bill Milne, Carmen Nitsche, Chris Reed, Rich Roberts, Peter Murray-Rust, Henry Rzepa, Steve Stein, Peter Shepherd, Bill Town, Andrea Twiss-Brooks, Wendy Warr, and Ann Wolpert