Chemical substructure search software
for personal computers
Stephen R. Heller and Daniel E. Meyer
This is the third in a series of articles aiming to promote a higher awareness of the computer applications in the
management, dissemination and uses of chemical data. It describes a number of computer programs which
perform chemical substructure searching on personal computers.
Introduction
Part of the educational activities of the IUPAC Committee on Chemical Databases (CCDB) is to inform
IUPAC members of interesting and significant computer activities. This relates to the CCDB term of
reference to promote a higher awareness of the computer applications in the management, dissemination
and uses of chemical data.
The first article described online computer services (1a). The second was concerned with the Beilstein and
Gmelin databases (1b). This third article describes a number of computer programs which perform
chemical substructure searching on personal computers (PCs).
Chemical substructure searching (SSS) is the ability to search a structure database using any structure
fragment. (Using the entire or full structure is usually called an identity search or exact-match search.) The
results of a SSS are the chemical structures in the database which contain the query structure fragment. An
example of such a search, using HTSS software (see below), is shown in Figs 1-5.
Over the years there have been many ways to represent chemicals, from nomenclature to chemical
notations to structure diagrams (which are actually mathematical graphs}. There are problems with some
forms of representation, such as nomenclature, as noted in the first articled. Linear notations, such as the
Wiswesser Lie Notation (WLN)2 as well as other notations such as SMILES3, all require learning a new
'language'.
With more powerful computers and inexpensive disk space becoming available in the past few years,
graphic input, storage, and searching of structures using PCs has become fairly commonplace. The best
part of drawing a structure or fragment is that it is the universal chemical language and thus readily
understood by all chemists. In this article we discuss PC computer systems for searching which use
structure diagrams for input (query) and output (answers).
Today, it is possible to search the Chemical Abstracts structure file of over nine million compounds online
using the STN or DARC systems (4).
These systems are available via telephone access from anywhere in the world for a cost of about USD
$ 250 per hour.
However, many chemists do not need regular access to such a large database. In most cases a database
of the chemicals in one's own laboratory or those under current study, or the chemicals from a commercial
catalogue, would satisfy day to day needs. Having one's own computerized file card system of chemicals is
a goal which can easily be obtained.
With the advent of the personal computer, a number of companies have developed, and are making
available for a nominal cost, software which allows individual researchers to create, maintain, and add to a
personal file of chemical structures (5,6). It Is also possible to buy some sample databases and use these
in teaching courses in universities and industry.
Hardware capabilities and costs
At present the only SSS software for PCs uses the IBM type MS-DOS-based operating systems. No
software for the Apple Macintosh is yet available, although the producers of two such software packages,
HTSS and ChemData (7) have indicated that such versions will be available in 1990.
The specific hardware requirements vary from program to program, with some programs having such
limited requirements that almost all PCs are able to run them without the need to buy additional equipment.
To use some other programs to their full capability, in particular ChemBase, requires computer hardware
that can easily cost about USD 6000-7000 as one needs a high resolution colour system with a fully
equipped laser printer system. Put another way,'ChemText's hardware appetite is quite hefty (8). However,
it should be noted that ChemText can run on hardware costing as little as a total of USD 2000.
To run most of the software described in this article the user needs an IBM PC or equipment with 51 2K of
main memory, a graphics capability (usually a Hercules graphics board, a Colour Graphics Adapter (CGA) or
Extended Graphics Adapter (EGA) or equivalent), a mouse for structure input, and preferably a hard disk. Such
a system should easily be available for about USD 1500 in the USA, and slightly higher in other countries. At
this price, it is felt that such equipment is well within the means of individual chemists.
Available PC SSS software package capabilities
This article describes five SSS PC programs. They are, in alphabetical order:
ChemBase
ChemFile II
ChemSmart
HTSS (TREE)
PSIDOM
One of these packages (HTSS) also runs on other, larger computers, such as the VAX or IBM mainframe.
As these computers are outside the scope of the computer systems available in laboratories or offices, these
aspects and features will not be discussed.
Structure search program descriptions
These five packages all allow the chemist for the first time to have this powerful tool available in the laboratory
or office. For university chemists, either for their own research activities or for use in teaching in courses, the
ChemFile II or ChemSmart programs are useful for learning SSS techniques and methods.
For those with more advanced needs, the HTSS and PSIDOM systems are likely to be the next step up. The best supported, but also the most expensive of the five packages is ChemBase. In any case, what
is of most importance is for chemists to begin to make use of this type of computer software, as it is likely to
gain acceptance throughout the chemical community.
CHEMBASE
ChemBase is the chemical substructure search software from Pergamon-Molecular Design Ltd (MDL) (8-12). MDL has sold more in-house systems than any other company in the field, and lately has expanded
into the PC market.
ChemBase is a well-polished software package, with excellent documentation and a very good user
interface. ChemBase handles data and text information as well as structures. ChemBase also interfaces
directly to the MDL text processor, ChemText. It requires an IBM PC, with 640K of RAM, a hard disk, a
mouse, and preferably an EGA card (although a CGA or Hercules graphics adapter will work), and a colour
monitor (although it will run quite adequately on a monochrome monitor).
A fully equipped laser printer brings out all of the superior features of the program, but dot matrix output is
acceptable. Its search speed and the size of files it can handle are less than for some of the other programs,
such as HTSS and PSIDOM. ChemBase also allows users to design their own output format.
CHEMFILE II
ChemFile II is the chemical substructure search program from COMPress (12,13), written by John Figueras,
a retired chemist from Eastman Kodak. ChemFile II is a very affordable floppy disk-based package, designed
for very small files and for people on a budget, who have a PC with 256K of memory and 5 1/4 inch floppy
disks.
If you have more than one disk of data to search, the program allows you to continue the search or use
another disk of data. Each diskette can hold data for about 250 compounds, which makes a complete
search of a large database (of several thousand compounds) slow and requires the user to sit at the
terminal and change disks until the search is completed.
Text and numerical information as well as structures can be entered. ChemFile II supports up to 20 user-defined data fields, but the size of each field is limited. A good feature of this program is the ability to search
for ranges of numerical data, such as boiling point ranges. The documentation and user interface are both
good.
CHEMSMART
ChemSmart is the chemical substructure search program being marketed by ISI (14-16), and written by
Scott Gould (11-13). ChemSmart is another affordable package for people on a budget with just a PC and 5
1/4 inch floppy disks.
The program supports some numerical and text information searching as well as structure searching.
However, while a good deal of data can be entered, stored, and displayed, only the molecular formula and
the compound names are searchable.
ChemSmart comes with a sample database of 250 compounds, and additional specialized databases from
the main ISI Index Chemicus database are available for purchase. These databases all have Index
Chemicus numbers, but no CAS Registry number. For teaching purposes, it would seem this product is
very useful. The documentation and user interface are quite good, but structure entry is slow.
HTSS
HTSS is the chemical substructure search program from Hierarchic Tree Substructure Search Systems
(17,18) and was developed in Hungary by Peter Bruck and colleagues. It is to be used by CCDB for IUPAC
computerized structure search products (see Box).
The official release of the PC version took place early in 1988, so it is too soon to comment on updates and
improvements. Versions of the HTSS software are also available which run on the VAX family of computers
and IBM mainframe computers.
HTSS is marketed in the USA under the names HTSS and TREE. It does not support data and text, but
rather interfaces with PC database management programs, such as dBASE III Plus. The program allows for
Markush structure searching, so long as you have such a database of Markush structures (19). It also
interfaces with virtually all word processors by allowing the user to add a structure at any place within a
word-processing program, such as WordStar or WordPerfect.
HTSS is by far the fastest of all the chemical structure software packages available, and can easily handle
large files of 25 000 structures or more. The documentation is its weakest point, but is improving. The
tutorial demonstration disk which comes with the system and the user interface are excellent.
PSIDOM
PSIDOM (Professional Structure Image Database On Microcomputer) is part of a family of PC software for
chemists from Hampden Data Services (20). PSIGEN is the basic software which allows an IBM PC
computer to create connection tables for chemical representation and then display these chemicals as
structures on the screen.
PsiBase (which includes the PSIGEN drawing capabilities) is the structure search module which takes the
PSIGEN connection tables and actually performs the substructure searching. PsiBase also allows the user
to store and search data together with the chemical structures.
At present the entire software system is being bundled together with the Derwent Standard Drug File which
contains over 16 000 chemical structures. The system has mouse-based structure input as well as
keyboard input. The documentation is good and the program is easy to learn. The ability to interface with
and easily use other modules developed by the company, such as PsiCard, PsiView, PsiPlot, and PsiText,
are a positive feature of this system.
The Hampden Data Services software is also the basis of the recently released STN Express software
(21,22) which allows offline structure query formulation before going online to search the CAS Online
structure database on STN-lnternational. PsiBase and STN Express are totally compatible and queries may
be exchanged between the two systems.
Summary
This article has described a number of commercially available computer programs which allow chemists
from all IUPAC countries throughout the world to create structure databases and search these structures,
by chemical structure fragment, for chemical structures in a PC database.
The ability to perform substructure searching is a valuable new tool for chemists in universities, industry
and government. Now that such software has become relatively inexpensive and easy to install and use on
personal computers, which chemists already own or have access to, this technique should become more
popular throughout the chemical community.
References
1. (a) Heller, S. R. Chem. Int. Vol. 9, pp. 136-138,
1987. (b) Heller, S. R. Chem. Int. Vol.11, pp. 49- !
52, 1989.
2. Smith, E. G. and Baker, P. A. The Wiswesser
line-formula chemical notation (WLNJ, 3rd
Edn. Chemical Information Management,
Cherry Hill, NJ, USA. 1976.
3. MedChem Software Manual, Release 3.32,
Medicinal Chemistry Project, Seaver Chemistry
Lab., Pomona College, Claremont, CA 91711,
USA.
4. Directory of Online Databases. Cuadra/Elsevier,
Vol. 9, January, see p. 106 and 323, 1988.
5. Meyer, D. E. Microcomputer-based software for
chemical structure management: a compari
son. ACS Symposium Series No. 341, ACS,
Washington DC 20036, USA, pp. 29-36,1987.
6. Warr, D. Introduction to graphics for chemical
structures. ACS Symposium Series No. 34, 1987.
7. ChemData will be available from VCH Publishers, 220 East 23rd Street, Suite 909, New York, NY 10010, USA. Tel: +1 (212)
683 8333.
8. Butler, L. C. Scientific Word Processor Inte9 rates Tricky Symbols. The Scientist, JUlY 11, p. 22, 1988.
9. Molecular Design Ltd.,2132 Farallon Drive, San Leandro,CA94577,USA.Tel: +1 (415)8951313.
10. Meyer, D. and Cohan, P. Designing new compounds with a PC database. Am. Biotech. Lab. Vol. 5, No.
1, pp. 32-39, 1987.
11. Seiter, C. Your PC may solve your chem lab problems. R&D, Vol. 29, No.3, pp. 94-96,1987.
12. Curry-Koenig, 8. PC Chemical databases--new tools for chemists. Am. Clin. Prod. Rev. March, pp.10 - 17,
1986.
13. COMPress, PO Box 102, Wentworth, NH 03282, USA. Tel: +1 (603) 764 5831 or +1 (800) 221 0419.
14. Figureas, J. An electronic notebook. ACS Symposium Series No. 341, ACS, Washington DC, 20036,
USA, pp. 37-47,1987.
15. ISI Software, ISI, 3501 Market Street, Philadelphia, PA 19104, USA. Tel: +1 (215) 386 0100 or +1 (800)
523 1850.
16. Gould, S. R. and Meyer, D. E. A chemical management systems for microcomputers. Am. Lab. Vol. 19,
No. 3, pp.126-127, 1987.
17. Meyer, D. E. Software for accessing chemical information. Am. Lab. Vol. 19, No. 6, 124-125, 1987.
18. Nagy, Z. M., Veszpremi, T., Csonka, G. and Bruck, P. Substructure search on a hierachictree of
chemical graphs. In: Heller, S. R. and Potenzone, R., eds. Computer applications in Chemistry: Proceedings
of the 6th International Conference on Computers in Chemical Research and Education. Elsevier Science
Publishers, Amsterdam. pp. 335-336, 1983.
19. HTSS (also known as TREE), Technical Data Service (TDS) Inc., Suite 2300, 10 Columbus Circle, New
York, NY 10019., USA. Tei: +1 (212) 245 0044 and ORAC Ltd., ULIS, 175 Woodhouse Lane, Leeds LS2
3AR, UK. Tel: +44 (532) 441821
20. Markush formulas are generic chemical structures, typically characterized by having variable
nature substituent groups and variable substitution patterns on chains of atoms and rings. For example, see:
Gillet, V. J., Welford, S. M.and Downs, G. M. Computer storage and retrieval of generic chemical structures
and patents. 7. Parallel simulation of a relation algorithm for chemical substructure search. J. Chem. Inf.
Comput. Sci. Vol. 26, pp.126-190,1986 and references cited therein.
21. Hampden Data Services Ltd., 167 Oxford Road,
Cowley, Oxford, OX4 3ES, UK. Tel: +44 (865)
747250.
22.STN Express, STN International, PO Box
02228,Columbus, OH 43202, USA. Tel: +1 (800) 848
6538 or + 1 (614) 421 3600, or STN International
Postfach 2465, D-7500 Karlsruhe 1, FRG.
Tel:+49 (724} 824566 or STN International,
JICST, 5-2 Negatacho 2 chrome, Chiyoda-ku, Tokyo
100, Japan. Tel: +81 (258) 46 6507.