CHEMICAL SUBSTRUCTURE SEARCHING ON A PC

Stephen R. Heller

Abstract: In the past two years a number of PC based computer programs have been developed to allow one to perform chemical substructure searching on PC computers. This overview of the field will describe the features of five major systems, ChemBase, ChemSmart, chemFile, HTSS, and PSIGen/PSIBase.

1 INTRODUCTION

Over the past decade there has been a considerable increase in the use of chemical substructure searching, primarily on large files of either in-house databases or the CAS database. These files generally range from the tens of thousands to over seven million entries. Because of the high cost of storing and searching these databases, they were all centralized, either on public or private large time sharing systems.

Many organizations have large in-house databases of the chemicals the company has synthesized and studied in the past. For such large databases, often reaching over 100,000 compounds, a program to search the chemicals is valuable for scientific and administrative purposes. In addition to a company-wide database, many chemists keep their own card file of chemicals they are working on, or are interested in. For many reasons, it is not practical to put such small databases on a central company computer. With the advent of hard disks with 20-40 MB of storage for PC's, it became feasible to consider developing software to handle such individual needs. While security is more of a problem when many scientists have their own databases, it seems this is a price many companies are willing to pay for the increased productivity of their scientists.

Within the past two years the microcomputer revolution has caught up with chemical substructure searching. The result has been a flurry of computer programs designed for the individual chemist to create (or use) a database in the lab or office without accessing the main computer. This paper will examine the features of five of these programs. A sixth program, TOPDOG (Ref. 1), will not be reviewed for two main reasons. The first is that is uses the SMILES notation, rather than chemical structures. This is considered to be unacceptable to the majority of chemists and information specialists because the notation is unwieldy. The second is that the price of $15,000 is out of the range of most people buying PC software. The slow sequential searching, often 10-15 seconds per compound, is also unacceptable for any large database. The five programs reviewed are (in alphabetical order):

1. ChemBase
2. chemFile
3. ChemSmart
4. HTSS
5. PSIGen/PSIBase

In addition to these programs which perform chemical structure searching (really the matching of mathematical graph patterns), there have been many programs developed which allows one to draw a chemical structure, but only for presentation, not for searching. Some of these are word processing programs, while others are simply specialty programs which allow for high quality output for reports and similar purposes. None of these programs are discussed here. For readers interested in examples of such programs, there is the recent excellent survey by Warr (Ref. 3).

This paper will compare the five named programs and provide a critical overview. Later presentations in this symposium will describe and discuss three of these systems in detail. As most of the programs are being updated and upgraded it is likely that some of the information and charts will be out of date by the time this article is read. It is advisable to check with each vendor to learn the latest status and capabilities of their software. When possible the appropriate literature references are provided to allow one to read further details of a particular program.

These five packages have varying capabilities, costs, and computer demands. Meyer (Ref. 2) has provided some comparison of four of these systems, and the discussion here is based on his work. The additional information used in the comparisons in this paper comes from the author's experience in using some of these programs, as well as comments from other users and responses to questions put to the vendors, who were all most cooperative. The main question one needs to ask of all of these software programs is "Why buy it?".

That is, what purpose do these programs serve, and how can they be used in your environment?

Everyone has their own view on what features are important and should be compared. Clearly the choice of features can radically alter the impression one gets of a particular software package. This paper enables readers to choose features which relate to requirements.

With most IBM PC computer programs today using 512K memory and having a hard disk, information on hardware requirements is important, but not critical in choosing what program to buy. However, you must to sure to examine the hardware requirements for the software you plan to buy, or you will have problems or be disappointed. For example, the excellent high resolution color graphics of the ChemBase program, which is the best of any of these software programs, requires an expensive color monitor and color graphics board, and a laser printer to take full advantage of what the software has to offer. These items, if you don't already own them, probably exceed the cost of the ChemBase software. Today chemists have accepted using a mouse as the way to enter chemical structures. At a cost of only $100 or so, this is not a factor to be concerned about. With this caveat, the following comparisons hopefully will provide potential users with some assistance in making their choice of chemical structure software.

2 CHEMBASE

ChemBase is the chemical substructure search software from Molecular Design Ltd. (MDL) (Refs. 4-7). MDL has sold more in-house systems than any other company in the field, and lately has expanded into the PC market. ChemBase is a well polished software package, with excellent documentation and a very good user interface. ChemBase handles data and text information as well as structures. ChemBase interfaces directly to the MDL word processor, ChemText. It requires an IBM PC, with 640K of RAM, a mouse, and preferably an EGA card and a color monitor. Its speed and the size of files it can handle are less than for some of the other programs.

3 CHEMFILE

ChemFile is the chemical substructure search program from COMPress (Ref. 8), written by John Rigueras, a retired chemist from Eastman Kodak (Ref. 9). ChemFile is designed for very small files and for people on a budget, who have just a PC with ordinary 5 1/4 inch floppy disks. If you have more than one disk of data to search, the program allows you to continue the search with another disk of data. Of course you need a disk for about every 250 compounds, which makes a complete search of a large database slow and requires the user to sit at the terminal and change disks until the search is completed. Data and text information as well as structures can be entered into chemFile.

ChemFile supports up to 20 user-defined data fields, but the size of each filed is limited. A good feature of this program is the ability to search for ranges of numeric data, such as boiling point ranges. The documentation and user interface are both good.

4 CHEMSMART

ChemSmart is the chemical substructure search program originally marketed by Academic Press, and now being marketed by ISI (Ref. 6), since Academic Press withdrew from the software market. ChemSmart was written by Scott Gould (Refs. 10-12). ChemSmart is also for people on a budget with just a PC and 5 1/4 inch floppy disks. ChemSmart supports data and text information as well as structures. ChemSmart comes with a sample database of 250 compounds, and additional specialized databases from the main ISI database are available for purchase. These databases all have Current Abstracts of Chemistry and Index Chemicus numbers, but no CAS Registry number. For teaching purposes it would seem this product is very useful. The documentation and user interface are quite good, but structure entry is slow.

5 HTSS

HTSS is the chemical substructure search program from Hierarchic Tree Substructure Search Systems (Refs. 13-14) and was developed in Hungary by Peter Bruck and colleagues and is now being jointly developed with the Beilstein Institute. The official release of the PC version was only this past summer, so it is too soon to comment on updates and improvements. It is the only one of the five which runs on the IBM PC (and clones) as well as the VAX family of computers, and the large main-frame IBM computers. HTSS does not support data and text, but rather interfaces with PC database management programs, such as dBASE III. It also interfaces with virtually all word processors by allowing one to add in a structure at any place within a word processing program, such as Wordstar or WordPerfect. HTSS is by far the fastest of all the chemical structure software packages available, and can easily handle large files of 25,000 structures or more. The documentation is its weakest point, but improving. The tutorial demonstration disk which comes with the system, and the user interface are excellent.

6 PSIGen/PSIBase

PSIGen/PSIBase (PSI means Professional Structure Image and is part of the PSIDOM (Database On Microcomputer) series of software) is the chemical substructure search program from Bill Town of Hampden Data Services (Ref. 15). PSIGen is the software which allows an IBM PC computer to create connection tables for chemical representation and then display these chemicals as structures on the screen. PSIBase is the structure search module which takes the PSIGen connection tables and actually performs the searching. Taken together, they are designed to be part of the PSIDOM series of PC programs for chemical word processing as well as structure display and searching. It has both mouse-based structure input as well as keyboard input. The documentation is good and the program is easy to learn. The ability to interface with and easily use other modules developed by the company, such as PSICard, PSIView, PSIPlot, and PSIText, are a positive feature of this system. At present, the software is copy protected and cannot be used on any computer with 3 1/2 inch disks.

7 COMPARISON OF FEATURES

Structure Input/Manipulation Capabilities:

Templates Enlarge/Shrink Rotate Clean Valence Check

ChemBase Y Y Y Y Y
chemFile Y Y N N N
ChemSmart Y Y Y* N N
HTSS Y Y Y N Y
PSIGen/PSIBase Y Y Y N Y

* Rotation is by 90 degrees only

Most of the programs have stereochemistry capability, but since 99.9% or more of searches are two dimensional, this capability seems not to be too useful or important. PsiGen/PsiBase also has the Feldmann (Ref. 16) input notation commands in addition to accepting input from a mouse.

Structure Search Capabilities:

Exact Sub Max Size File

Structure Structure (Practical) Structure

ChemBase Y Y 4000 Sequential
chemFile Y Y 1000 Sequential
ChemSmart Y Y 2000 Sequential
HTSS Y Y 25000 Inverted Tree
PSIGen/PSIBase Y Y 6000 Primarily

Sequential

Search times vary considerable, with HTSS taking from 5-15 seconds to a few minutes (2 - 10) for the sequential search programs to search a database of 1000 compounds. Only HTSS has a Markush generic search capability, which is available as an option. However, unless you create your own Markush (Ref. 17) formula database this capability is of no value.

Other Searching Capabilities:

Text Interface to Interface to

PC DBMS Word Processor

ChemBase Y Y Y
chemFile Y N N
ChemSmart Name & MF* N N
HTSS N Y Y
PSIGen/PSIBase Name Only Y Y

(using PSICard)

* MF = Molecular Formula

Data Search Capabilities:

ID # Name MF MW* Other Data

Fields

ChemBase Y Y Y Y Y
chemFile Y Y Y Y Y
ChemSmart Y Y Y N Y
HTSS Y N N N N
PSIGen/PSIBase Y Y N Y Y

* MW = Molecular Weight

None of the programs has the ability to upload a structure into either of the two main chemical structure search systems, STN-CAS ONLINE and QUESTEL-DARC. Furthermore none of the programs can download structures from either of these two online systems into a PC in order to create a database for local searching.

Prices:

ChemBase $ 3500
chemFile $ 150
ChemSmart $ 335
HTSS $ 499
PSIGen/PSIBase $ 990

(PSICard, a program module which enables data and/or text to be stored and associated with a given chemical structure costs an additional $195).

Most of the programs come with sample files, which are useful in allowing one to learn how to use the software. However, the user needs to create his or her own database, since searching a sample file is good only for learning and demonstrations.

8 SUMMARY

Of the five programs, three (ChemBase, HTSS, and PSIGen/PSIBase) have sufficient capabilities to be useful in almost any lab. ChemBase has the advantage of being one of a number of software packages from a company devoted solely to computer software for chemists. Its documentation is first class. Its major disadvantage is the considerable cost of the program.

HTSS is the most powerful of the programs in terms of chemical structure searching, can easily handle a large number of compounds with very fast retrieval times, and has an equivalent version of the HTSS software running on VAX and IBM mainframe (3090) computers. It suffers from a lack of polished documentation.

PSIGen/PSIBase (along with PSICard for data file and retrieval) is a well designed package for medium sized files. Its interface with word processing, data management software, and other modules makes it an attractive total system.

The other two systems are probably best for teaching and learning activities.

9 ACKNOWLEDGEMENTS

The author would like to thank the vendors for their help in supplying information which was used in preparing this article and to Dan Meyer (ISI) for his insightful comments.

Stephen R. Heller
U.S. Department of Agriculture
Model and Database Coordination Laboratory
Building 007, Room 56, BARC-West
Beltsville, MD 20705-2350 USA
Phone: (301) 344-1709
Telemail: SRHELLER
Bitnet: SRHELLER @ ARS BARC

10 REFERENCES

1. Health Designs, Inc., 183 Main Street, Rochester, NY 14604 USA.

2. D. E. Meyer, "Microcomputer-Based Software for Chemical Structure Management: A Comparison", Chapter 4, pages 29-36, ACS Symposium Series #341, ACS, Washington DC 20036 USA, 1987.

3. W. Warr, "Introduction to Graphics for Chemical Structures", pages ix - xiii, ACS Symposium Series #341, ACS, Washington, DC 20036 USA, 1987.

4. Molecular Design Ltd., 2132 Farallon Drive, San Leandro, CA 94577 USA.

5. D. Meyer and Peter Cohan, "Designing New Compounds with a PC Database", Am. Biotech. Lab., 5(1), pages 32-39(1987).

6. C. Seiter, "Your PC May Solve Your Chem Lab Problems", R&D, 29(3), pages 94-96(1987).

7. B. Curry-Koenig, "PC Chemical Databases - New Tools for Chemists", Am. Clin. Prod. Rev., pages 10-17, March 1986.

8. COMPress, PO Box 102, Wentworth, NH 03282 USA.

9. J. Figueras, "An Electronic Notebook", Chapter 5, pages 37-47, ACS Symposium Series #341, ACS, Washington DC 20036 USA, 1987.

10. ISI Software, ISI, 3501 Market Street, Philadelphia, PA 19104 USA.

11. S. R. Gould and D. E. Meyer, "A Chemical Management Systems for Microcomputers", Amer. Lab., 19 (3), 126-127(1987).

12. D. E. Meyer, "Software for Accessing Chemical Information", Amer. Lab., 19(6), 124-125(1987).

13. Z. M. Nagy, T. Veszpremi, G. Csonka, and P. Bruck, "Substructure Search on a Hierarchic Tree of Chemical Graphs", in S. R. Heller and R. Potenzone (Eds.), Computer Applications in Chemistry, Proceedings of the 6th International Conference on Computers in Chemical Research and Education, pages 335-336, Elsevier Science Publishers, Amsterdam (1983).

14. HTSS (also known as TREE), Technical Data Service (TDS) Inc., Suite 2300, 10 Columbus Circle, New York, NY 10019 USA.

15. Hampden Data Services Ltd., Hampden Cottage, Abingdon Road, Clifton Hampden, Abingdon, Oxon, OXC14 3EG, England.

16. R. E. Feldmann and S. R. Heller, "An Application of Interactive Graphics - The Nested Retrieval of Chemical Structures", J. Chem. Doc., 12, 48-54(1972).

17. Markush formulas are generic chemical structures, typically characterized by having variable nature substituent groups and variable substitution patterns on chains of atoms and rings. For example see V. J. Gillet, S. M. Welford, M. F. Lynch, P. Willett, J. M. Barnard, and G. M. Downs, "Computer Storage and Retrieval of Generic Chemical Structures in Patents. 7. Parallel Simulation of a Relation Algorithm for Chemical Substructure Search", J. Chem. Inf. Comput. Sci., 26, pages 126-130(1986) and references cited therein.