Structure Searching Systems

S. Heller,
US Environmental Protection Agency, USA

Keywords: Chemical Structure Searching, Substructure Searching, CAS ONLINE, CROSSBOW, DARC, MACCS, NIH/EPA CIS, SANSS, Questel, ORGANIC, WLN, Wiswesser Line Notation

Abstract: The state-of-the-art in chemical structure and substructure searching are reviewed, with a table of features highlighting six commercially available systems. This is an introduction to the presentations from the symposium which follow, which discuss the systems in further detail, from the point of view of the user.

1 INTRODUCTION

The considerable growth and interest in chemical structure search systems has led to this ongoing symposium in this area. Over the past year the community has seen a maturation of the existing systems, and a fewer number of additional features and capabilities being added. Chemical structure search systems have been developing for over two decades. The first systems, such as the GREMAS system in Germany and the now defunct CIDS system in the USA, were followed by systems developed at Merck, WRAIR, Upjohn and elsewhere. It was not until the Wiswesser Line Notation (WLN) based CROSSBOW system, developed at ICI, was made available to outside companies did the first commercially available system become available to chemists. In the last six years connection table based systems have been developed and made publicly available, either as an online system or for in-house use.

The symposium being organized at this meeting, like the ones held in the past two years, is for the purpose of having real users describe their needs, and how the system (or systems) they are using meet these needs. While a particular user and even a particular organization will have unique needs, it was felt that such a collection of presentations from users (as opposed to system vendors) would shed new light on this developing field. The maturation of this field of the past year, while considerable, no doubt leaves some requirements of the users unfilled, which hopefully will be reflected in the remarks of the speakers. In one way or another, all six systems described here (ORGANIC, a new WLN based structure search system from Japan is described briefly here, but there will be no presentation on the system this year) continue to develop and evolve, although some more than others. Two of the systems, DARC and the NIH/EPA CIS SANSS are Government sponsored systems, while three others CROSSBOW, ORGANIC, and MACCS are from private companies, and the last, CAS ONLINE, is the first venture by this non-profit organization to disseminate its own database. Owing to announced changing state of affairs of these systems the table which is given at the end of this paper is to used only a rough guide. Insufficient information was available, in English, on the ORGANIC system so this system is not included in the table at the end of this paper. The table, without considerable amount of text may also lead to some misunderstanding and distortions of the systems described. This is always a problem when one tries to provide a convenient. summary of a very complex subject. However, as each users' application differs slightly, the best solution will be for a person to define their needs, prepare typical expected queries hat reflect such needs, and test each system against these needs. In an attempt to highlight the five systems, short descriptions of each system will provided. Again the reader is reminded that there is no "best" solution or system to be presented, as each application differs.

2 CAS ONLINE

The CAS ONLINE system is first attempt of the Chemical Abstracts Service to directly provide its own databases to the user community. The CAS ONLINE system searches the very large (over 6 million) CAS Registry database of chemicals. Most of the CAS Registry file has structure representations, but there are a few hundred thousand chemicals which do not. Furthermore, polymers and coordination compounds have a set of structure characteristics which set these classes of chemical aside from the rest of the "classical" organic chemicals. The net result is that in CAS ONLINE (as well as the DARC system) a search of the whole file cannot be done with one query. How important this point is depends on the question being asked. (Related to this question, is the how complete the database is, and does going back to 1965 sufficient for user needs? This point, is being addressed by CAS in their latest venture to register chemicals prior to 1965.) One important point to note when searching with this software, is the approach taken here to include all possible answers, and force the user to add qualifiers to limit the list of answers.

The chemical structure output capabilities of the CAS ONLINE system remains the best available in any of the six systems, although the MACCS output (prepared by hand input from the user) is very good. Graphics terminals are required to see the output, or the printout must be mailed to the user. In the past year the CAS ONLINE systems has added a new graphics input procedure. Most important has the been the addition of abstracts in the CAS ONLINE system, which by a policy decision, are not available on the online systems (DARC/QUESTEL, Lockheed, SDC, ESA, DataStar, Infoline, and so forth). The searching of the abstracts, is planned for the end of 1983. Lastly, CAS has announced the plan to add some chemical catalogs as searchable "databases", similar to the Fine Chemical Directory (FDC) which is available on Infoline, and a similar, but somewhat larger, project of chemical catalogs available on the NIH/EPA CIS SANSS.

3 CROSSBOW

The CROSSBOW system is the oldest and most stable of the six systems discussed here. The CROSSBOW system is also the only widely used system that is Wiswesser Line Notation (WLN) based, and as such, is the primary reason this system is chosen by the many companies which have their in-house file in WLN. The CROSSBOW system is also the only system at present which has direct biology data searching and manipulation and report generation built in as part of an overall complete system. (However, see the comments about the new ORGANIC system, which is also WLN based.) This chemistry/biology capability is to be available using the SANSS software, under a project of the US Government National Cancer Institute (NCI), sometime in 1984. Since CROSSBOW has WLN it cannot be readily linked to the vast CAS bibliographic literature, however it can be linked to the ISI bibliographic and structure (over 3 million chemicals) files, this has been done at one private installation. (The ISI file will also be connection table searchable on the DARC system, described below.)

4 DARC

The DARC system has been under development for over a decade and a half, but just made commercially available in the past few years. The DARC system is designed to search the same large CAS database of chemical structures as CAS ONLINE, and as such, is in direct competition with CAS ONLINE. The main improvements to the DARC system in the past year has been the much need Boolean logic searching, which has improved the search capabilities of the systems, and often reduced costs for users. offspring searching has also reduced users costs. Saved queries and saved answer files, available on other systems, are now also available on DARC. The generic DARC search software, of considerable interest to industrial users doing Markush type structure searching for patents is a valuable addition. The separate polymer file is a convenience for users.

The DARC system uses inverted files (as opposed to the CAS ONLINE sequential searching), which would make it more time consuming to update (and hence is updated monthly vs. weekly for CAS ONLINE), but a quicker and an highly interactive system as compared to CAS ONLINE. The lack of chemical name output is a real problem with this system. The chemical drawing program of the DARC system is quite good, although not as pleasing as the CAS ONLINE output. The cost of the DARC vs. CAS ONLINE system is somewhat lower most of the time. The ease in which a user can enter a structure query in DARC gives this system a considerable advantage of over the CAS ONLINE (and all of the other systems) in learning to use the system

5 MACCS

The MACCS system, developed for in-house use of a internal company database is highly graphics oriented, and owing to the dedicated nature of the MACCS computer system, the searching goes rapidly on small files, and the molecular display and manipulation capabilities, found only here (and in a less sophisticated manner in a NIH/EPA CIS SANSS related software package (CHEMLAB)), is a very nice feature. Being the only connection table based system of the six designed for in-house use and connection table use, it continues to enjoy a unique position in the market place. The MACCS also has the ability to perform registration (although users must generally define their own convention for tautomers, metal bonds, and so forth, leading to possible problems and inconsistencies with the software) for in-house databases, which is a important need. One major issue with the MACCS system causing concern is the refusal of the vendor to provide the source code. Besides being of concern to management, the lack Of source code makes it impossible to connect MACCS with in-house data and systems, without vendor assistance. The additional features (available for a price) such as reaction indexing, molecular modeling, and structure activity software, continue to add to the attractive features of this system.

6 ORGANIC

A recent joint venture between Sumitomo Chemical Company and Nippon Electric Company of Japan has produced a new structure search and display system called ORGANIC. It is a conversational substructure search system based on WLN. The database, created by the user, can be searched by WLN, a fragment code of some 120 codes, as well as by connection table. The vendor has stated the system can be used for structure activity relationships (QSAR) when pharmacological effect data is added to the system. The system cannot handle polymers, chiral chemicals, undefined and variable composition materials, and the other classes of materials for which WLN. has limitations. The system runs in both batch and interactive modes, with answers in English and Japanese. Hopefully further details about this system will become available in the near future.

7 NIH/EPA CIS SANSS

The last system to be discussed is the NIH/EPA Chemical Information System (CIS) SANSS (Structure and Nomenclature Search System). This system is part of a larger US Government sponsored project which contains the largest collection ofpublicly available numeric databases in the fields of spectroscopy, toxicology, chemical regulations, andenvironmental information. The SANSS system is the most versatile of all the systems described herein that one is able to search by chemical structure, fragment code, ring code, chemical names, partial and complete molecular formula, molecular weight and full structures. The SANSS system, however, has only some 253,000 chemical structures and 805,000 chemical names, which means it contains about 4% of the DARC or CAS ONLINE structures and about 10% of the names in the full CAS database. The 253,000 chemicals have been assembled from collections of lists of interest to the users of the system and includes lists of interest outside the USA, like the EEC Chemical Inventory. This Chemical Locator Function (CLF) of SANSS is unique to the six systems, although it appears it will be imitated, to some degree, in the near future. Recent cooperation between Government agencies is resulting in some much needed documentation for the system, improved searching capabilities, and a complete "biology search and display" system, which should prove attractive to the needs of many users. The expected date of this software being available appears to still 3-6 months away.

The SANSS system does provide for teletype as well as graphic output of chemical structures, but these are not very good in comparison with the structures which are output by the CAS ONLINE, DARC and MACCS systems. In addition, the lack of a automatic search in SANSS makes it necessary to carefully construct and execute a series of queries. (This feature will be available as part of the new software improvements being undertaken.) Lastly, the SANSS system does allow for complete, partial, left and right truncated name searching with display of all available names contained in the system, which can number in excess of 700 names for a chemical. This system is much less expensive than CAS ONLINE or DARC and may be useful as a first pass search search for a user before spending over $100 to search the large CAS database.

8 SUMMARY

The six systems discussed above and whose features for five of the six are outlined in the following table have many unique and many overlapping characteristics. For the user to decide which system or systems will best their needs, it is strongly suggested that a detailed analysis of the problem be undertaken, coupled with a number sample searches to see. what, in fact, the results are for a particular problem. Just seeing a demonstration or asking existing users if a system is "good", is not the way to approach such a complex and valuable tool. Recently a workshop was held in the USA on chemical structure searching, and interested readers are advised to obtain a copy of the workshop study, which has excellent questions asked of a number of systems and a valuable analysis of a number of the systems (CAS ONLINE, DARC and SANSS) described here (Ref. 1).

This field is continuing to develop, albeit more slowly than the past 2-3 years. It still appears likely the current competition between systems and vendors seen in the past few years will continue to improve these systems and provide the best overall service to the scientific and information user community which these systems were designed to serve.

9 REFERENCES

1. "Structure Searching Workshop Proceedings" is available from CRC Systems, Inc., 4020 Williamsburg Court, Fairfax, Virginia 22032 USA (703-385-0440).

Table of Summary of Survey

of Five Structure/Substructure Search Systems

Feature/Item CAS CROSSBOW DARC MACCS SANSS
First Available or operational 11/80 1969 1/80 1979 5/77
Number of User Organizations 800 25 800 30 500
Costs:
Software Purchase Price ($K) N/A 61 25-55 115+ (1) 2 (2)
Online Use - Yearly Subscription Fee $0 N/A $0 N/A $300
Online Use -Approximate Hourly Fee $180 N/A $150 N/A $85



Online Use - Hit Fee $0.10 - $1.00 N/A $0.14 - $0.28 N/A $0
Online Use - Search Fee/Cost $80 N/A $72 N/A $15 (3)
Database Size:
# Chemicals

(7/83)

6,300,000 10,000 -

3,000,000

6,300,000 32,000 (4) 255,000
# Chemicals Expected (12/83) 6,500,000 same 6,500,000

(CAS File)



3,000,000

(ISI File)

N/A 300,000
Output/

Input:





a)Structure Display:

Teletype

Graphics

(Vectors)





No

Yes





Yes

No





No

Yes





No

Yes





Yes

Yes

b) Structure Input:

Teletype

Graphics

(Vectors)

WLN





Yes

Yes



No





No

No



Yes





Yes

Yes



Yes (5)





Yes

Yes



No





Yes

No



No

c) Maximum

# Names

50 None ? One None
Administrative:
Manuals -

English

Other



Yes

No



Yes

Japanese



Yes

French



Yes

No



Yes

Japanese

Training Yes Yes Yes Yes Yes
Training Cost $250 (6) ? $50 - $250 ? $100
Training Required Yes ? No ? No
US Toll Free 800 Hotline Yes No Yes No Yes
Linking:
Bibliographic Information Yes Yes Yes Yes No (7)
Non-Bibliographic

Information

No Yes Yes Yes Yes
Private Files Yes Yes Yes Yes Yes
Hardware IBM 3081

PDP-11

IBM/370

DEC-10/20

Prime - 400

ICL -1900

VAX



Burroughs- 6700

Honeywell

IRIS-80

Prime

VAX

IBM - (coming)

DEC10/20
Software Assembler & PL1 Cobol & Assembler Fortran Fortran Fortran
Batch (B) or Interactive (I) B (8) B I I I
Structure Data Source(s):
CAS Connection Table No Yes No Yes Yes
Other Connection Table No Yes No Yes Yes
WLN No Yes Yes (9) No (10) No (10)
Fragment Codes Modified Swiss Screens Crossbow Fragments DARC Screens MACCS Screens CIDS Screens & CIS Ring Fragment Screens
Searching Capabilities:
WLN No Yes Yes (9) No No
Fragment Code Yes Yes No Yes Yes
Molecular Formula No Yes Yes (9) Yes (11) Yes
Name(s) Yes (12) No No Yes (12) Yes
Identity search

(Exact Structure)

Yes Yes Yes Yes Yes
Atom-by-atom Yes Yes Yes Yes Yes
CAS RN Yes No Yes Yes Yes


N/A = Not available.
? = Not available or unknown.
(1) + indicates additional features are extra, and the cost VAX or Prime computer should be added, if needed
(2) Only software provided for this price. Installation costs, if any, are extra.
(3) For CAS ONLINE & DARC this is a fixed fee/search; the figure of $15 for SANSS represents the charge for a typical search
(4) Public demonstration file. In-house files can handle larger files.
(5) Available on ISI database
(6) Fee includes equivalent amount of usage
(7) Indirect links provided to Lockheed, SDC, and NLM.
(8) Online, fast batch
(9) Available on ISI file only.
(10) No, but can be converted for use in the system.
(11) No dot disconnects.
(12) Exact Name only - no delimiters.