Chemical substructure search software for personal computers

Stephen R. Heller and Daniel E. Meyer

This is the third in a series of articles aiming to promote a higher awareness of the computer applications in the management, dissemination and uses of chemical data. It describes a number of computer programs which perform chemical substructure searching on personal computers.

Introduction

Part of the educational activities of the IUPAC Committee on Chemical Databases (CCDB) is to inform IUPAC members of interesting and significant computer activities. This relates to the CCDB term of reference to promote a higher awareness of the computer applications in the management, dissemination and uses of chemical data.

The first article described online computer services (1a). The second was concerned with the Beilstein and Gmelin databases (1b). This third article describes a number of computer programs which perform chemical substructure searching on personal computers (PCs).

Chemical substructure searching (SSS) is the ability to search a structure database using any structure fragment. (Using the entire or full structure is usually called an identity search or exact-match search.) The results of a SSS are the chemical structures in the database which contain the query structure fragment. An example of such a search, using HTSS software (see below), is shown in Figs 1-5.

Over the years there have been many ways to represent chemicals, from nomenclature to chemical notations to structure diagrams (which are actually mathematical graphs}. There are problems with some forms of representation, such as nomenclature, as noted in the first articled. Linear notations, such as the Wiswesser Lie Notation (WLN)2 as well as other notations such as SMILES3, all require learning a new 'language'.

With more powerful computers and inexpensive disk space becoming available in the past few years, graphic input, storage, and searching of structures using PCs has become fairly commonplace. The best part of drawing a structure or fragment is that it is the universal chemical language and thus readily understood by all chemists. In this article we discuss PC computer systems for searching which use structure diagrams for input (query) and output (answers).

Today, it is possible to search the Chemical Abstracts structure file of over nine million compounds online using the STN or DARC systems (4).

These systems are available via telephone access from anywhere in the world for a cost of about USD

$ 250 per hour.

However, many chemists do not need regular access to such a large database. In most cases a database of the chemicals in one's own laboratory or those under current study, or the chemicals from a commercial catalogue, would satisfy day to day needs. Having one's own computerized file card system of chemicals is a goal which can easily be obtained.

With the advent of the personal computer, a number of companies have developed, and are making available for a nominal cost, software which allows individual researchers to create, maintain, and add to a personal file of chemical structures (5,6). It Is also possible to buy some sample databases and use these in teaching courses in universities and industry.

Hardware capabilities and costs

At present the only SSS software for PCs uses the IBM type MS-DOS-based operating systems. No software for the Apple Macintosh is yet available, although the producers of two such software packages, HTSS and ChemData (7) have indicated that such versions will be available in 1990.

The specific hardware requirements vary from program to program, with some programs having such limited requirements that almost all PCs are able to run them without the need to buy additional equipment. To use some other programs to their full capability, in particular ChemBase, requires computer hardware that can easily cost about USD 6000-7000 as one needs a high resolution colour system with a fully equipped laser printer system. Put another way,'ChemText's hardware appetite is quite hefty (8). However, it should be noted that ChemText can run on hardware costing as little as a total of USD 2000.

To run most of the software described in this article the user needs an IBM PC or equipment with 51 2K of main memory, a graphics capability (usually a Hercules graphics board, a Colour Graphics Adapter (CGA) or Extended Graphics Adapter (EGA) or equivalent), a mouse for structure input, and preferably a hard disk. Such a system should easily be available for about USD 1500 in the USA, and slightly higher in other countries. At this price, it is felt that such equipment is well within the means of individual chemists.

Available PC SSS software package capabilities

This article describes five SSS PC programs. They are, in alphabetical order:

ChemBase

ChemFile II

ChemSmart

HTSS (TREE)

PSIDOM

One of these packages (HTSS) also runs on other, larger computers, such as the VAX or IBM mainframe. As these computers are outside the scope of the computer systems available in laboratories or offices, these aspects and features will not be discussed.

Structure search program descriptions

These five packages all allow the chemist for the first time to have this powerful tool available in the laboratory or office. For university chemists, either for their own research activities or for use in teaching in courses, the ChemFile II or ChemSmart programs are useful for learning SSS techniques and methods.

For those with more advanced needs, the HTSS and PSIDOM systems are likely to be the next step up. The best supported, but also the most expensive of the five packages is ChemBase. In any case, what

is of most importance is for chemists to begin to make use of this type of computer software, as it is likely to gain acceptance throughout the chemical community.

CHEMBASE

ChemBase is the chemical substructure search software from Pergamon-Molecular Design Ltd (MDL) (8-12). MDL has sold more in-house systems than any other company in the field, and lately has expanded into the PC market.

ChemBase is a well-polished software package, with excellent documentation and a very good user interface. ChemBase handles data and text information as well as structures. ChemBase also interfaces directly to the MDL text processor, ChemText. It requires an IBM PC, with 640K of RAM, a hard disk, a mouse, and preferably an EGA card (although a CGA or Hercules graphics adapter will work), and a colour monitor (although it will run quite adequately on a monochrome monitor).

A fully equipped laser printer brings out all of the superior features of the program, but dot matrix output is acceptable. Its search speed and the size of files it can handle are less than for some of the other programs, such as HTSS and PSIDOM. ChemBase also allows users to design their own output format.

CHEMFILE II

ChemFile II is the chemical substructure search program from COMPress (12,13), written by John Figueras, a retired chemist from Eastman Kodak. ChemFile II is a very affordable floppy disk-based package, designed for very small files and for people on a budget, who have a PC with 256K of memory and 5 1/4 inch floppy disks.

If you have more than one disk of data to search, the program allows you to continue the search or use another disk of data. Each diskette can hold data for about 250 compounds, which makes a complete search of a large database (of several thousand compounds) slow and requires the user to sit at the terminal and change disks until the search is completed.

Text and numerical information as well as structures can be entered. ChemFile II supports up to 20 user-defined data fields, but the size of each field is limited. A good feature of this program is the ability to search for ranges of numerical data, such as boiling point ranges. The documentation and user interface are both good.

CHEMSMART

ChemSmart is the chemical substructure search program being marketed by ISI (14-16), and written by Scott Gould (11-13). ChemSmart is another affordable package for people on a budget with just a PC and 5 1/4 inch floppy disks.

The program supports some numerical and text information searching as well as structure searching. However, while a good deal of data can be entered, stored, and displayed, only the molecular formula and the compound names are searchable.

ChemSmart comes with a sample database of 250 compounds, and additional specialized databases from the main ISI Index Chemicus database are available for purchase. These databases all have Index Chemicus numbers, but no CAS Registry number. For teaching purposes, it would seem this product is very useful. The documentation and user interface are quite good, but structure entry is slow.

HTSS

HTSS is the chemical substructure search program from Hierarchic Tree Substructure Search Systems (17,18) and was developed in Hungary by Peter Bruck and colleagues. It is to be used by CCDB for IUPAC computerized structure search products (see Box).

The official release of the PC version took place early in 1988, so it is too soon to comment on updates and improvements. Versions of the HTSS software are also available which run on the VAX family of computers and IBM mainframe computers.

HTSS is marketed in the USA under the names HTSS and TREE. It does not support data and text, but rather interfaces with PC database management programs, such as dBASE III Plus. The program allows for Markush structure searching, so long as you have such a database of Markush structures (19). It also interfaces with virtually all word processors by allowing the user to add a structure at any place within a word-processing program, such as WordStar or WordPerfect.

HTSS is by far the fastest of all the chemical structure software packages available, and can easily handle large files of 25 000 structures or more. The documentation is its weakest point, but is improving. The tutorial demonstration disk which comes with the system and the user interface are excellent.

PSIDOM

PSIDOM (Professional Structure Image Database On Microcomputer) is part of a family of PC software for chemists from Hampden Data Services (20). PSIGEN is the basic software which allows an IBM PC computer to create connection tables for chemical representation and then display these chemicals as structures on the screen.

PsiBase (which includes the PSIGEN drawing capabilities) is the structure search module which takes the PSIGEN connection tables and actually performs the substructure searching. PsiBase also allows the user to store and search data together with the chemical structures.

At present the entire software system is being bundled together with the Derwent Standard Drug File which contains over 16 000 chemical structures. The system has mouse-based structure input as well as keyboard input. The documentation is good and the program is easy to learn. The ability to interface with and easily use other modules developed by the company, such as PsiCard, PsiView, PsiPlot, and PsiText, are a positive feature of this system.

The Hampden Data Services software is also the basis of the recently released STN Express software (21,22) which allows offline structure query formulation before going online to search the CAS Online structure database on STN-lnternational. PsiBase and STN Express are totally compatible and queries may be exchanged between the two systems.

Summary

This article has described a number of commercially available computer programs which allow chemists from all IUPAC countries throughout the world to create structure databases and search these structures, by chemical structure fragment, for chemical structures in a PC database.

The ability to perform substructure searching is a valuable new tool for chemists in universities, industry and government. Now that such software has become relatively inexpensive and easy to install and use on personal computers, which chemists already own or have access to, this technique should become more popular throughout the chemical community.

References

1. (a) Heller, S. R. Chem. Int. Vol. 9, pp. 136-138,

1987. (b) Heller, S. R. Chem. Int. Vol.11, pp. 49- !

52, 1989.

2. Smith, E. G. and Baker, P. A. The Wiswesser

line-formula chemical notation (WLNJ, 3rd

Edn. Chemical Information Management,

Cherry Hill, NJ, USA. 1976.

3. MedChem Software Manual, Release 3.32,

Medicinal Chemistry Project, Seaver Chemistry

Lab., Pomona College, Claremont, CA 91711,

USA.

4. Directory of Online Databases. Cuadra/Elsevier,

Vol. 9, January, see p. 106 and 323, 1988.

5. Meyer, D. E. Microcomputer-based software for

chemical structure management: a compari

son. ACS Symposium Series No. 341, ACS,

Washington DC 20036, USA, pp. 29-36,1987.

6. Warr, D. Introduction to graphics for chemical

structures. ACS Symposium Series No. 34, 1987.

7. ChemData will be available from VCH Publishers, 220 East 23rd Street, Suite 909, New York, NY 10010, USA. Tel: +1 (212) 683 8333.

8. Butler, L. C. Scientific Word Processor Inte9 rates Tricky Symbols. The Scientist, JUlY 11, p. 22, 1988.

9. Molecular Design Ltd.,2132 Farallon Drive, San Leandro,CA94577,USA.Tel: +1 (415)8951313.

10. Meyer, D. and Cohan, P. Designing new compounds with a PC database. Am. Biotech. Lab. Vol. 5, No. 1, pp. 32-39, 1987.

11. Seiter, C. Your PC may solve your chem lab problems. R&D, Vol. 29, No.3, pp. 94-96,1987.

12. Curry-Koenig, 8. PC Chemical databases--new tools for chemists. Am. Clin. Prod. Rev. March, pp.10 - 17, 1986.

13. COMPress, PO Box 102, Wentworth, NH 03282, USA. Tel: +1 (603) 764 5831 or +1 (800) 221 0419.

14. Figureas, J. An electronic notebook. ACS Symposium Series No. 341, ACS, Washington DC, 20036, USA, pp. 37-47,1987.

15. ISI Software, ISI, 3501 Market Street, Philadelphia, PA 19104, USA. Tel: +1 (215) 386 0100 or +1 (800) 523 1850.

16. Gould, S. R. and Meyer, D. E. A chemical management systems for microcomputers. Am. Lab. Vol. 19, No. 3, pp.126-127, 1987.

17. Meyer, D. E. Software for accessing chemical information. Am. Lab. Vol. 19, No. 6, 124-125, 1987.

18. Nagy, Z. M., Veszpremi, T., Csonka, G. and Bruck, P. Substructure search on a hierachictree of chemical graphs. In: Heller, S. R. and Potenzone, R., eds. Computer applications in Chemistry: Proceedings of the 6th International Conference on Computers in Chemical Research and Education. Elsevier Science Publishers, Amsterdam. pp. 335-336, 1983.

19. HTSS (also known as TREE), Technical Data Service (TDS) Inc., Suite 2300, 10 Columbus Circle, New York, NY 10019., USA. Tei: +1 (212) 245 0044 and ORAC Ltd., ULIS, 175 Woodhouse Lane, Leeds LS2 3AR, UK. Tel: +44 (532) 441821

20. Markush formulas are generic chemical structures, typically characterized by having variable

nature substituent groups and variable substitution patterns on chains of atoms and rings. For example, see: Gillet, V. J., Welford, S. M.and Downs, G. M. Computer storage and retrieval of generic chemical structures and patents. 7. Parallel simulation of a relation algorithm for chemical substructure search. J. Chem. Inf. Comput. Sci. Vol. 26, pp.126-190,1986 and references cited therein.

21. Hampden Data Services Ltd., 167 Oxford Road,

Cowley, Oxford, OX4 3ES, UK. Tel: +44 (865)

747250.

22.STN Express, STN International, PO Box

02228,Columbus, OH 43202, USA. Tel: +1 (800) 848

6538 or + 1 (614) 421 3600, or STN International

Postfach 2465, D-7500 Karlsruhe 1, FRG.

Tel:+49 (724} 824566 or STN International,

JICST, 5-2 Negatacho 2 chrome, Chiyoda-ku, Tokyo

100, Japan. Tel: +81 (258) 46 6507.