Steven M. Bachrach
Department of Chemistry
1 Trinity Place
San Antonio, TX 78212
This presentation surveys the area of electronic manuscript publishing with an emphasis as to how XML, CML, and the INChI activities are having and will continue to have a major impact on scientific publishing.
pdf, e-publishing, Open Access, PubChem, SPARC, XML, Chemical Markup Language, CML, INChI, IUPAC, IUPAC-NIST Chemical Identifier, Multimedia
Six years ago one of us (SRH) wrote: “To write a chapter about a topic which is so new and developing so rapidly that changes take place just about everyday is an interesting challenge. What I hope to accomplish in these few pages is to explain what electronic publishing is and explore a number of issues associated with this new area of information dissemination. Yes!, this is a new area of dissemination! And perhaps this is the place to start - by defining electronic publishing. Electronic publishing is a new form of communication. Electronic publishing, for the purposes of scholarly scientific presentation of results, is the creation of a scholarly work that is in a totally electronic (non-paper) form from its creation to its publication or dissemination. An electronic journal is a product that was specifically developed and designed for the Internet, a product that is not re-worked printed material that is delivered electronically. As I hope to show in this chapter, electronic journals and electronic publishing is much more than an alternative to print.” (1)
It is sad to report that electronic publishing has yet to be a new form of publishing for scholars. To a considerable extent the publishing industry has only re-worked what was in print and simply converted it into an electronic form, often preserving the look-and-feel of hardcopy print as well
Current publication practices
Over the past years there has been a large mass migration to electronic print. The American Chemical Society (ACS) (2), Royal Society of Chemistry (3), Elsevier (4), John Wiley (5), Springer (6), Taylor & Francis (7) and virtually all other publishers now provide their journals in electronic form. Virtually all publishers provide pdf (8) versions. While some provide just pdf, others offer an html version in addition to the pdf.
Many publishers have created a common interface to all of their journals. One example, Elsevier Science Direct (9), launched in 1999, currently contains over 5 million articles. Science Direct is said to be the world's largest scientific, technical and medical (STM) database. While it is a large collection of journal articles, supplemented by relevant bibliographic databases, it, like similar competitors, contain virtually nothing more than what appears in hardcopy print.
While some publishers do provide supplemental materials – such as video files, audio files, Excel spreadsheets and Word files, these are effectively add-ons and not an integral part of the article. A minor exception is the web enhancements that ACS publishes with some articles. These enhancements have typically been interactive molecules, but the vast majority of articles do not include any enhancements whatsoever. Put very plainly and simply, most authors are not able to “think out of the box” when it comes to presenting their research results or when trying to describe something other than in simple text. Furthermore, publishers and reviewers have not been actively encouraging their authors to include non-textual materials within their manuscripts. Examples of non-textual materials relevant to the chemical sciences include interactive molecular structures and spectra that can be manipulated by the reader; interactive 2-D and 3-D plots where the reader can directly manipulate the data, refit lines, incorporate their own data points; animations from molecular dynamics studies that allow the reader to freeze-frame, proceed through in slow-motion, etc.
The multimedia power of the Internet has yet to touch the major journal publishers. In the past decade there have been a number of new journals started in the area of multimedia (10 a-e), but few (10f) do more than just talk in principle about multimedia. Those journals which do attempt to create a new form of publishing, such as the Internet Journal of Chemistry (IJC) (11) (originated by one of the authors (SMB) and the Interactive Multimedia Electronic Journal of Computer-Enhanced Learning (10f), have had less than enthusiastic success. The IJC has attracted relatively few papers, and one reason is that many authors have yet to realize and take advantage of the power of true Internet publishing. Over the past few years the lack of progress in this area has been shown not to be a technical matters, but rather a social and political issue. As Max Planck pointed out some 70 years ago in his autobiography (12):
"New scientific truth does not triumph by convincing its opponents and making them see the light, but rather because it opponents eventually die, and a new generation grows up that is familiar with it."
It will be primarily the generation that has grown up on video games, Napster, Instant Messenger and the like who will become the authors who will take advantage of this new technology.
While the actual content of journals has not improved by their being put into electronic form, the mechanism for dissemination is a new paradigm. The number of people who physically go to their libraries to read or look up a journal article has dropped off dramatically. Working from an office or at home, the Internet has changed the habits of most scholars. As some 2/3’s of the cost of a library (13) is for the building and staff; this change is usage practice may provide for some economic improvement. However, the increased cost of journals in the past few years (14) more than negates this savings. Libraries have responded principally by canceling some journal subscriptions, but also by budget tightening in other areas.
Rather than dwelling on the negatives of the current publication system – high journal subscription fees, rates of price increases that well-exceed that of the cost-of-living, tightening library budgets, cancelled journal subscriptions – we sense a real opportunity for scholars to find new ways to communicate, principally by exploiting Internet technology. A few recent initiatives are discussed now.
While pre-prints of scholarly works have been available, formally (such as abstracts from ACS meetings) and informally (such as high-energy physics pre-prints) for decades, the Internet provides the means by which such pre-prints and scholarly works can be available more readily and at less expense. Current use of pre-print servers have had mixed success, based largely on cultural and societal mores. Such systems as the arXiv (15), ChemWeb chemistry preprint server (16), openArchive (17) have started a trend that may auger a total revolution in the dissemination of scholarly publishing.
The high price of journals led the Association of Research Libraries (ARL) to create the Scholarly Publishing and Academic Resources Coalition (SPARC) project (18), which is designed specifically to target the highest priced commercial journals in order to inflict the greatest pain on their profits. SPARC principally sponsors journals created by societies and independent scientist groups that offer more cost-effective means of distributing information. A small number of the SPARC sponsored journals have now established themselves as important resources. Some anecdotal evidence does suggest that commercial publishers are beginning to contain cost increases in part as a response to the SPARC initiative.
The PubScience project (19), a publicly available web-based tool to access articles published in peer-reviewed journals was a US Government Department of Energy sponsored project. It was designed to encourage access to articles without having to wade through multiple websites, publications and references. Unfortunately, this project was discontinued in late 2002 as a direct result of the pressure from commercial and non-commercial STM organizations.
While there have been a number of disappointing aspects and lack of progress to electronic publishing in the past few years, there have also been some new and interesting projects that show signs of progress. One such project is the COUNTER project (20). In the past, it was hard, if not impossible, to really tell if a journal or a particular article in a journal was actually ever read by anyone save the author and his/her mother. Electronic publishing allows the publisher to actually count the number of times an article is accessed. These web statistics are however vulnerable to misinterpretation and misuse. The COUNTER project intends to establish best standards and practices for gathering and disseminating usage statistics. This information will be of enormous use to readers and librarians in analyzing effective journals and to make difficult subscription decisions. Further, it will allow publishers to assess the effectiveness of their own products, especially in relation to their competitors.
One major change in the reading habits of most scholars is they no longer need to walk to their library to read or copy an article. While little has been done to enhance a journal article, the citations at the end of an article has been made much more useful by such projects as CrossRef (21), SFX (22), ChemPort (23), and LitLink (24). With these technologies, the citations become active links to the cited article. The reader can then access the cited articles by clicking through and reading on the screen. Thus, there is no need to access the physical library; the virtual library is sufficient.
In the late 1970's the first computer system that linked chemical structures and names with chemical information and data was established at the NIH (25). The system evolved into the NIH/EPA Chemical Information Systems (CIS). While the system, like PubScience, was discontinued by the US Government, the concept did not die. The NIH/NLM/NCBI Entrez system (26), MDL DiscoveryGate (27), and projects such as the Elsevier Dymond project (28) are examples of how linking of chemical structures, data, and the literature can be seamlessly done. Lastly, the NIH/NLM/NCBI has initiated a project, called PubChem (29), which is projected to be a system of chemical structures and biological data resources, available and searchable in a manner similar to their Entrez system.
Interoperability and Data Standards
Chemical Markup Language (CML) Perhaps the key element toward creating a truly revolutionary publication system is the ability of computers and humans to readily and seamlessly share and reuse information. A major impediment has been the lack of commonality of file formats used amongst programs and users. For example, there are well over 50 partially and fully defined formats to describe molecular structures; common ones are SMILES, MDL molfile, Sybil mol file, GAUSSIAN input, etc. Some of these formats are not fully defined, such as SMILES, since they are changed over time and/or are open to different interpretations of the incomplete structure definition rules (30).
An outgrowth of the popularity of the web and its HTML format is the recognition of the ubiquitous need for common interoperable standard file formats for all disciplines. The extensible markup language (XML) is a set of rules for creating formats for disparate disciplines, each capable of defining markup specific to that particular discipline, yet structured so that multiple different users and programs can process the information. Chemists were amongst the first adopters of XML, recognizing the power of a structured file format that would allow for the widespread reuse of chemical information among a broad cross-section of the discipline. Such a format would allow users to exchange the full 3-D structure of a molecule, its NMR and IR and mass spectra in a loss-less manner, furthermore enabling each end-user to direct this same file into a myriad of different software applications specific to their individual needs and purposes.
The first proposed XML for chemistry was termed CML, Chemical Markup Language (31). The originators of CML, Peter Murray-Rust and Henry Rzepa, note on their CML web site: “The origins of domain specific scientific (i.e. non-bibliographic) markup languages can traced at least as far as the first World-Wide Web conference (WWW1) held at CERN in May 1994, when a session on the future of HTML developed into a discussion of how Mathematics and Chemistry might be expressed” . Over the course of the next nine years, CML has been defined by a DTD and recast as XML schema. Prototype CML browsers along with applications that read and write CML have been developed. In addition, IUPAC has an initiative to develop a markup language that will enable reuse of its standards series of books (32).
IUPAC-NIST Chemical Identifier (INChI)
In March 2000 IUPAC convened a meeting in Washington DC to look into the matter of chemical structure representation (33). The IUPAC Strategy Roundtable meeting was called “Representations of Molecular Structure: Nomenclature and its Alternatives”. It brought together 41 participants from 10 countries including experts in organic, inorganic, biochemical, and macromolecular nomenclature; users of nomenclature in academia, industry, the patent, international trade, health and safety communities; journal editors and publishers; database providers; and software vendors.
Over the past decade, with the ever-increasing reliance on computer processing by chemists, it became evident to many within IUPAC that this organization should find better ways of handling nomenclature was done in the past. In particular it was felt that while IUPAC had stressed conventional chemical names/nomenclature in the 20th century, continued progress into the 21st century required new, computer-driven approaches to the problem of chemical identification.
At the March 2000 meeting a proposal was presented to IUPAC, which extended one developed by one of the authors (SRH) in the fall of 1999. The initial proposal from November 1999 was widely circulated with the chemical information and chemical structure representation community via e-mail. The proposal presented at the March 2000 meeting was incorporated considerable improvements from this feedback from chemists in the USA, Europe, and Asia.
At the end of the March 2000 meeting Bill Town (33) proposed that the new program be called IUPAC Chemical Identifier Project (IChIP). The name was changed in the fall of 2003 to INChI to reflect the considerable and effective efforts and support of the NIST staff who have developed virtually the entire project.
The aim of the IUPAC-NIST Chemical Identifier Project (INChIP) is to establish a unique label, the IUPAC-NIST Chemical Identifier (INChI), which would be a non-proprietary identifier for chemical substances that could be used in printed and electronic data sources. INChI will enable easier linking of diverse data compilations and provide unambiguous identification of chemical substances (34). A number of short articles have been written describing the INChI project (35-37).
INChI is not a registry system. It does not depend on the existence of a database of unique substance records to establish the next available sequence number for any new chemical substance being assigned an INChI. It will be based on a set of IUPAC structure conventions, and rules for normalization and canonicalization (38) of an input structure representation to establish the unique label. It will thus enable an automatic conversion of a graphical representation of a chemical substance into the unique INChI label, which can be created independently of any organization anywhere in the world. INChI could then be built into any chemical structure drawing program and created from any existing collection of chemical structures. To date only the ACDLabs ChemSketch freeware (39) drawing programming has an automatic interface to INChI.
In early 2004 the first official release of the INChI algorithm was delivered to the chemistry community. Time will tell if this proposed open source standard will in fact become a standard.
In closing the authors wish to present their view of what the future of electronic publishing will look like. When this new model or system will be implemented cannot be predicted with any degree of accuracy, however, we believe its contents and characteristics can be reasonable well predicted.
The journals of the future will be one large distributed database, maintained mostly by libraries. While likely owned by many players, the separate journals will be presented in an XML-like manner, with multiple interconnects between journal articles and ancillary databases. Most of these journals will be Open Access (40). For chemistry, the INChI will serve as the universal chemical structure link between these databases. Google-type search engines will evolve into text mining engines capable of searching all of these journal databases, numeric and related informational databases as well as linking to various biological databases of genomic, proteomic, and metabolomic information. Chemical structure searches will be performed using plug-ins from chemistry software vendors. Chemically-aware search queries will take the user across and through these databases. Multidirectional links make for a highly interconnected network of chemical information.
The winners in this view of the future will be the authors and researchers and libraries. We hope that this multi-directional interconnected information network can be delivered to those with reasonable budgets. As usual, the losers will be the owners of the current technologies, particularly those who cannot adapt to new and changing market conditions
The authors would like to acknowledge the long term comments and feedback from a number of our colleagues, especially Alan McNaught, Bill Milne, Peter Murray-Rust, Carmen Nitsche, Henry Rzepa, Peter Shepherd, Steve Stein, Bill Town, and Wendy Warr.
1. S. R. Heller, Electronic Publishing of Scientific Manuscripts, in The Encyclopedia of Computational Chemistry Schleyer, P. v. R.; Allinger, N. L., Clark, T.; Gasteiger, J.; Kollman, P. A.; Schaefer III, H. F.; Schreiner, P. R. (Eds.); John Wiley & Sons: Chichester, 1998, pp 871-875. Available at: http://www.hellers.com/steve/resume/p146.html
10 a) IEEE Transactions on Multimedia (started in 1999) (b) Multimedia Tools and Applications (started in 1995) (c) IEEE Multimedia Magazine (started in 1994) (d) ACM Journal of Multimedia Systems (started in 1993) (e) Journal of Visual Languages and Computing (started in 1990) (f) Interactive Multimedia Electronic Journal of Computer-Enhanced Learning, http://imej.wfu.edu
12. “Scientific Autobiography and Other Papers", Williams & Norgate, London (1950), pages 33-34.
13. "However, for each $1 spent on journal acquisitions, other library costs come to $2." Andrew Odlyzko: "The economics of electronic journals". Available at http://www.dtc.umn.edu/~odlyzko/doc/economics.journals.txt
14. a) http://www-cs-faculty.stanford.edu/~knuth/joalet.pdf (b) http://www.library.cornell.edu/scholarlycomm/elsevier.html
25. a)S. R. Heller, G. W. A. Milne, and R. J. Feldmann, A Computer Based Chemical Information System, Science, 195, 253-259(1977). b) S. R. Heller, The NIH/EPA Chemical Information System (CIS) Physical and Chemical Databases, Drexel Library Quarterly, 18, No. 3 & 4, pages 39-66 (1982).
30. http://www.daylight.com/ and Weininger D. "SMILES, a chemical language and information-system .1. Introduction to methodology and encoding rules" J. Chem. Inf. Comp. Sci., 1988,28(#1),31-36.
31. CML: http://www.xml-cml.org/
35. Unique labels for compounds C&EN, 26 Nov 2002 http://pubs.acs.org/cen/today/nov26.html
36. That ICHI feeling ... The Alchemist, 24 Apr 2002 http://www.chemweb.com/alchem/articles/1015947904091.html
37. What's in a Name? The Alchemist, 21 Mar 2002 http://www.chemweb.com/alchem/articles/1015947151360.html
38. Stephen E. Stein, Stephen R. Heller, and Dmitrii Tchekhovski. An Open Standard for Chemical Structure Representation - The IUPAC Chemical Identifier, 2003 Nimes International Chemical Information Conference Proceedings, pages 131-143 (2003). Also see slides of INChI presentations: http://www.hellers.com/steve/pub-talks/
39. Advanced Chemistry Development, http://www.acdlabs.com
40. http://www.earlham.edu/~peters/fos/guide.htm and http://www.earlham.edu/~peters/fos/fosblog.html