The USDA/ARS/NAL Plant Genome Information System

The USDA/ARS/NAL Plant Genome Information System
A 3rd Year Status Report

Stephen R. Heller*
USDA, ARS, Beltsville, MD 20705-2350 USA
SRHELLER@ASRR.ARSUSDA.GOV

Jerome P. Miksche
USDA, ARS, Beltsville, MD 20705-2350 USA
JMIKSCHE@ASRR.ARSUSDA.GOV

and
Douglas Bigwood
USDA, NAL, Beltsville, MD 20705-2351 USA
DBIGWOOD@NALUSDA.GOV

* Author to whom correspondence should be addressed

ABSTRACT

The goal the United States USDA Plant Genome Research Program (PGRP) is to improve plants (agronomic, horticultural, and forest species) by locating marker genes on chromosomes, determining gene structure, and transferring genes with the capability of improving performance with reduced environmental impact to meet market place needs and niches. To support this goal the USDA Plant Genome Database (PGD) project was started in 1991. The prototype database now under development will include the Corn, Soybean, Wheat, Arabidopsis, and Forest Tree Species. The current objectives of this ARS project are: a) initiate beta-testing of the prototype common database on the genetics of these species and work in a cooperative effort to establish a generic database for agronomic plants, and b) engender a "grass roots" support base for the genome database efforts, i.e., the scientists are participating in the development of the database in contrast to a "top down" approach. Funding is being provided to labs at Ames, IA (soybean), Columbia, MO (corn), Albany, CA (wheat, forest trees), and Boston, MA (Arabidopsis). This presentation will describe the database creation and provide examples of the currently functioning database.

INTRODUCTION

The goal of the USDA, ARS Plant Genome Research Program (PGRP) is to improve plants (agronomic, horticultural, and forest species) by locating marker genes on chromosomes, determining gene structure, and transferring genes with the capability of improving performance with reduced environmental impact to meet market place needs and niches.

It has taken a number of years from the initial conception of this program to when the US Congress initiated funding for this in 1991. A short summary of the history and immediate future of the plant genome program is given in Table 1.

TABLE 1

PLANT GENOME PROGRAM

1987 NIH/DOE Human Genome Project established

1988 Plant Genome project proposed by J. Miksche, ARS
Asst. Secretary Bentley (USDA) endorses ARS to lead Plant Genome project (10/88)
Crop & Forest Genome Mapping Conference held in
Washington DC (12/88)

1989 J. Miksche appointed Director of USDA, ARS Plant Genome Office (4/89)
Interagency Plant Genome Coordinating Committee established & meets (5/89 & 8/89)

1990 ARS given $ 99,000 "seed money" for plant Genome planning activities
S. Heller assigned to project and given responsibility for Plant Genome informatics activities (9/90)

1991 ARS receives $2.9 million additional funds for Plant Genome project
CSRS/National Resources Initiative (NRI) receives $10.0 million for Plant Genome mapping activities

Analysis of genomic research at ARS, land grant schools, industry, and foreign groups

Plant Genome Information center established at NAL

Funding areas: Database, AGRICOLA enhancements, Collections, Newsletter, Reference Works, Books

Funding for mapping & data collection/evaluation activities dispersed to ARS labs (1-3/91)

Database analysis & initial system design (2-6/91)

1992 ARS receives $2.9 million for Plant Genome project

CSRS/NRI receives $12.2 million for Plant Genome mapping activities. Request for Proposals (RFP) for Plant Genome activities published in the US Federal Register.

First International Plant Genome Meeting held (PG I) with 415 attendees (11/92)

Initiate examination of gene traits ready for delivery

1993 Beta-testing of NAL Plant Genome Database (PGD) (6/93)

1994-1995 Public release of operational NAL Plant Genome
Database (PGD)

Second International Plant Genome Meeting to be held
(PG II) (1/94)

Initiate plans to add additional species (e.g., pea, sorghum, cotton, peanut, rice, lettuce, and tomato) into PGD

Initiate plans for satellite nodes of PGD with groups in Europe (EC) and Asia

This paper is concerned with the information and database aspects of the program. The Plant Genome Database (PGD) consists of a number of components. Broadly speaking the master database is divided into three parts. They are the stock center databases, the mapping data databases, and the sequence database.

One of the unique features of the plant genome project, as opposed to the human genome project, is the ability to experiment and breed species. While man has been breeding plants and animals for centuries, the ability to perform such experiments in a more scientific manner is one of the primary benefits expected from this work. The stock centers, located in the USA and elsewhere, will be of utmost practical importance to this project in the future. At present there is a database, called GRIN, Germplasm Research Information Network (1) which has been developed, and while not directly part of the Plant Genome Research program, it will provide a valuable link between the germplasm and the genetic information from the other two database systems.

The second area of databases for this project are the various mapping databases. This includes physical, genetic, RFLP, and so forth, maps of species. These databases are being developed in a coordinated manner for the first time under the direction of this project. At present ARS is funding five prototype mapping efforts. These are:

Corn (Maize)
Wheat
Soybean
Pine tree
Arabidopsis

The last of these, while being effectively a weed, is a good model plant system, with a relatively small genome (about 70 million base pairs).

The third and last area consists of the actual sequences of the base pairs found in the DNA. Since this data is the same as the data which is going into GenBank/GenInfo/European Molecular Biology Laboratories (EMBL) databases (2), it was felt that there was no need to independently develop a plant sequence database. Hence ARS is putting all its sequence data into this one, universal, sequence database.

Thus, two of the three database activities are using existing ongoing efforts, which it was felt, from both a management and a scientific point of view was the appropriate course of action to take. This leaves "only" the mapping databases for ARS to be concerned with at this time. The approach ARS has taken is rather simple. ARS is providing funding for each of the five species listed above and asking a particular lab to take the lead in coordinating all of data which would go into the mapping database associated with that species. Coordination includes obtaining the data (or running the necessary experiments if there is no available data), performing some evaluation and quality control on the data, and then putting the species specific data into a local database system at the coordinators home lab or institution.

The resulting public database is then sent to the US National Agriculture Library (NAL) where it is integrated into a master database system with data from all crop species (3). The current NAL effort involves the use of the Sybase relational database management system (DBMS) software, which is the same software used by the human genome project researchers. In the future, other database management systems based on object-oriented data structures will be explored. At present, the resulting integrated relational database is being made available, via Internet, to scientists from all over the world (4). In this way, the various coordinators do not have the burden of having to perform a service operation, in which they are generally not interested, and most importantly, not at all qualified or experienced to do. The NAL is, on the other hand, in the business of service and support, and has the proper management philosophy towards the future of libraries, which includes a heavy emphasis on electronic distribution of information. In addition to the integrated map data from the five species, the NAL system has the full text of the AGRICOLA related literature references searchable as part of the system capabilities. All the data from the five databases are also full-text searchable, providing the user with a very powerful and complete system.

Besides the online system available via Internet, NAL will also disseminate the database on tape for those wishing to create their own systems. Lastly, it is expected that there will be a CD-ROM version of the database for those who prefer that medium.

In addition to the service role NAL will be performing another important function. NAL will be responsible for assuring that all of the databases coming from the mapping groups are consistent and standardized to the maximum extent possible.

For example, standards are being established or adopted by NAL for gene nomenclature, literature citations, and terminology.

Prototype Database Development for Corn, Soybean, Wheat, Arabidopsis, and Forest Tree Species

The assigned objectives of this ARS managed project are: a) initiate beta-testing of the prototype generic database on the genetics of these species and work in a cooperative effort to establish a generic database for agronomic plants, and b) engender a "grass roots" support base for the genome database efforts, i.e., the scientists are participating in the development of the database in contrast to a "top down" approach. Funding was provided to locations at Ames, Iowa (soybean), Columbia, Missouri (corn), Albany, California (wheat, forest trees), and Harvard (Arabidopsis). The present structure of the database program comes from sources other than just the five laboratories mentioned above and goes to NAL Plant Genome database through Internet.

Database design and related information were agreed to this year by all participating groups. A demonstration of the system was given at the Plant Genome I meeting and the acceptance and attendance was overwhelming. A larger and more comprehensive demonstration will be given at Plant Genome - II in January 1994 (5).

Priority database topics include Disease/Pathology, Genetic Resources, Germplasm, Genetic Maps, Metabolic Pathways, Organelles, Quantitative Traits, and Database Quality Control as determined by ARS in cooperation with scientists from the public and private sectors. Groups continue to explore specific user needs for each of these topics. To facilitate and expand data collection, assimilation, and database definition for some of these topics, specific cooperative agreements were arranged with researchers working with genetic mapping (RFLP), and other mapping procedures, nitrogen metabolism, oil and fatty acid biosynthesis, genetic collections, evolution, and software development. Progress and problems pertaining to these priority topics and specific cooperative agreements have been employed to better address specific assigned database problem areas. Additional specific cooperative agreements with various agricultural research labs around the USA and in foreign countries are being drafted for database work dealing with diseases, pathology and molecular mapping. Agreements between ARS and the Computing Sciences Division, Lawrence Berkeley Laboratories, Berkeley, CA and Yale University for the design and implementation of the representative species genome database have also been established. The net result of these cooperative projects has been a carefully planned and well agreed to database and access system which should help assure its success in the years to come as the system grows and world-wide agriculture community uses the system as part of the everyday research activities of scientists in labs throughout the world.

Centralization of Database Activities at the National Agricultural Library

NAL has established the Plant Genome Data and Information Center (PGC) in support of the Plant Genome Program with an allocation of about $1 million per year from ARS. The major accomplishments which this funding has provided can be divided into three areas: the Plant Genome Database, Information Dissemination, and Bibliographic Materials Enhancement.

Plant Genome Database (PGD)

The initial design of the Plant Genome Database (PGD) is completed. It already includes phenotypic trait and germplasm data. Metabolic data remain as a design issue. More raw data will be placed in the database. The design is generic; no species specific design elements exist at present. This will ensure that the database easily accommodates future expansion to other species without major design changes. However, data input and further developments are necessary. The database is now in beta-test and will be released in 1994. It's current contents are shown in Table 2.

Table 2

November 1993 Status of the PGD

3.3 Megabytes of Data
3.3 Million records
300,000 References
7,200 Sites
108 Maps
4,430 Allele Variations
21,000 Stocks
2,593 Traits
340,000 Phenotypes

Existing databases, such as GenBank, PIR, and GRIN, are (for the first two) and will be (for GRIN) linked into the Plant Genome Database. This integration is essential to guarantee the maximum utility of the data while minimizing duplication of effort with centralization of agricultural information as is tied in with the National Institutes of Health Human Genome Database and the database of the EMBL.

Information Dissemination

NAL has established an information center as part of the PGC that is responsible for disseminating genomic information to the public. The Center responds to requests for specific types of information, reviews all available sources, and reports back to the requestor with the finds. An outreach program has been developed to bring the Plant Genome Program to the scientific community at major scientific conferences and research locations. An example of this is Probe(6), the quarterly newsletter of the Plant Genome Program. Probe contains updates about the progress of the program as well as articles from some of the top researchers in genetics; national and international circulation for the first issue exceeded 6,000. The Center also develops technical, subject-oriented publications. These publications include directories of experts; bibliographies related to methodologies and software; listings of sequence genes. Other information products will be produced to meet the information needs of the program and the scientific community.

Bibliographic Materials Enhancement

NAL's AGRICOLA bibliographic database system has been improved in three fundamental ways: first, the focus of the system has been altered to expand coverage of genetic information sources; second, the quality of the bibliographic records have been enhanced by adding extensive abstracts; and third, new keywords, which identify specific records containing genomic data, have been added. All of these enhancements greatly improve a user's ability to retrieve relevant information. This is especially true when one considers that these enhancements are being linked with the other data in the Plant Genome Database (PGD).

Examples of results from the Plant Genome Database

Providing examples of interactive computer systems for a journal paper is really an impossible task, so only "snapshots" of the system will be given. It should be mentioned that this problem has been overcome in oral presentations in many modern lecture halls by using a computer system connected to the projection screen or by using a previously prepared video tape (in the proper NTSC, PAL, or SECAM format).

Figure 1 show the main menu and Figures 2a-c show the results from a real search (for all stock names beginning with the letters "ch"). Figures 3a-b shows the results from a reference search. Any of these figures (screen dumps from the computer screen) can be sent, via electronic mail to anyone who has an e-mail address on the Internet. For explanations of each figure please refer to the text beneath the figure or read the figure captions listed after the references at the end of this paper.

CONCLUSIONS

The Plant Genome Database (PGD) is now a real and functioning information and data resource for agricultural genome researchers. As the system increases in size and intellectual content is value will greatly increase and enhance the abilities of researchers to undertake more sophisticated genome research. which will ultimately benefit the farmer community and all consumers.

REFERENCES

1. J. D. Mowder and A. K. Stoner, "Information Systems", Plant Breeding Review, 7, pages 57-65 (1989); b) for further details and an information brochure, contact the GRIN - The Germplasm Resource Information System, Database Manager, USDA, ARS, Bldg. 003, Beltsville MD 20705-2350. Phone:1-301-504-5666; FAX: 1-301-504-5536.

2. See Science, 22 October 1993, pages 502-505.

3. For further details contact the Plant Genome Information Center, USDA, NAL, 10301 Baltimore Blvd., Beltsville, Maryland 20705-2351, USA. Phone: 1-301-504-6875; FAX: 1-301-504-7098 or e-mail to SMCCARTHY@ASRR.NALUSDA.GOV.

4. Access to the database is available via Internet as follows:
a) telnet to PROBE.NALUSDA.GOV; b) login as "PGD' with "GENOME" as the password; c) follow the instructions; d) fill out the online registration form if you have not done so. If you are having any problems with access to the system, please send an e-mail message to: PGENOME@NALUSDA.GOV.

5. For information on the Plant Genome meeting series please contact : Scherago International, Inc., 11 Penn Plaza, Suite 1003 New York, NY 10001 (Phone: 212-643-1750; FAX: 212-643-1758).

6. Probe, the Newsletter for the USDA, Plant Genome Research Program, Order from USDA, NAL, 10301 Baltimore Blvd., Beltsville, Maryland 20705-2351, USA. Phone: 1-301-504-6875; FAX: 1-301-504-7098 or e-mail to SMCCARTHY@ASRR.NALUSDA.GOV.

Captions for Figures:

1. A computer screen dump of the Main Menu for PGD (note that Sites are all Mapped objects). This, and all the other figures are screen dumps from a Sun UNIX workstation.

2a. A computer screen dump showing the results of a search for all stock names beginning with the letters "ch".

2b. The Detail Screen for the stock names "Chinese Spring" (selected from the choices in Figure 6).

2c. The Detail Screen of a collection (ARS - National Small Grain Collection) where the stock Chinese Spring can be found. This screen was brought up by going to "Stock" from the Main Menu and searching for "chinese Spring". Then the Collection option was selected from the Go TO menu.

3a. A computer screen showing the results of a search for all maps for all species. [This screen was reached by choosing the map option off of the Main Menu.]

3b. This computer screen is the detail of a reference record retrieved from the search done in Figure 3a. Note that the entire abstract exists and can be viewed by the user, but that it will not all fit on the computer screen at the same time. However, the user can scroll through the abstract to view all the pages.