* Author to whom correspondence should be addressed
The goal the United States USDA Plant Genome Research
Program (PGRP) is to improve plants (agronomic, horticultural,
and forest species) by locating marker genes on chromosomes,
determining gene structure, and transferring genes with the
capability of improving performance with reduced environmental
impact to meet market place needs and niches. To support this
goal the USDA Plant Genome Database (PGD) project was started in
1991. The prototype database now under development will include
the Corn, Soybean, Wheat, Arabidopsis, and Forest Tree Species.
The current objectives of this ARS project are: a) initiate beta-testing of the prototype common database on the genetics of these
species and work in a cooperative effort to establish a generic
database for agronomic plants, and b) engender a "grass roots"
support base for the genome database efforts, i.e., the
scientists are participating in the development of the database
in contrast to a "top down" approach. Funding is being provided
to labs at Ames, IA (soybean), Columbia, MO (corn), Albany, CA
(wheat, forest trees), and Boston, MA (Arabidopsis). This
presentation will describe the database creation and provide
examples of the currently functioning database.
INTRODUCTION
The goal of the USDA, ARS Plant Genome Research Program
(PGRP) is to improve plants (agronomic, horticultural, and forest
species) by locating marker genes on chromosomes, determining
gene structure, and transferring genes with the capability of
improving performance with reduced environmental impact to meet
market place needs and niches.
It has taken a number of years from the initial conception
of this program to when the US Congress initiated funding for
this in 1991. A short summary of the history and immediate
future of the plant genome program is given in Table 1.
1987 NIH/DOE Human Genome Project established
1988 Plant Genome project proposed by J. Miksche, ARS
Asst. Secretary Bentley (USDA) endorses ARS to lead Plant
Genome project (10/88)
Crop & Forest Genome Mapping Conference held in
Washington DC (12/88)
1989 J. Miksche appointed Director of USDA, ARS Plant Genome Office (4/89)
Interagency Plant Genome Coordinating Committee established
& meets
(5/89
&
8/89)
1990 ARS given $ 99,000 "seed money" for plant Genome planning
activities
S. Heller assigned to project and given responsibility for
Plant Genome informatics activities (9/90)
1991 ARS receives $2.9 million additional funds for Plant Genome project
CSRS/National Resources Initiative (NRI) receives $10.0
million for Plant Genome mapping activities
Analysis of genomic research at ARS, land grant schools, industry, and foreign groups
Plant Genome Information center established at NAL
Funding areas: Database, AGRICOLA enhancements, Collections, Newsletter, Reference Works, Books
Funding for mapping & data collection/evaluation activities
dispersed
to
ARS
labs
(1-3/91)
Database analysis & initial system design (2-6/91)
1992 ARS receives $2.9 million for Plant Genome project
CSRS/NRI receives $12.2 million for Plant Genome mapping activities. Request for Proposals (RFP) for Plant Genome activities published in the US Federal Register.
First International Plant Genome Meeting held (PG I) with 415 attendees (11/92)
Initiate examination of gene traits ready for delivery
1993 Beta-testing of NAL Plant Genome Database (PGD) (6/93)
1994-1995 Public release of operational NAL Plant Genome
Database (PGD)
Second International Plant Genome Meeting to be held
(PG II) (1/94)
Initiate plans to add additional species (e.g., pea, sorghum, cotton, peanut, rice, lettuce, and tomato)
into PGD
Initiate plans for satellite nodes of PGD with groups in
Europe (EC) and Asia
This paper is concerned with the information and database
aspects of the program. The Plant Genome Database (PGD)
consists of a number of components. Broadly speaking the master
database is divided into three parts. They are the stock center
databases, the mapping data databases, and the sequence database.
One of the unique features of the plant genome project, as
opposed to the human genome project, is the ability to experiment
and breed species. While man has been breeding plants and
animals for centuries, the ability to perform such experiments in
a more scientific manner is one of the primary benefits expected
from this work. The stock centers, located in the USA and
elsewhere, will be of utmost practical importance to this project
in the future. At present there is a database, called GRIN,
Germplasm Research Information Network (1) which has been
developed, and while not directly part of the Plant Genome
Research program, it will provide a valuable link between the
germplasm and the genetic information from the other two database
systems.
The second area of databases for this project are the
various mapping databases. This includes physical, genetic,
RFLP, and so forth, maps of species. These databases are being
developed in a coordinated manner for the first time under the
direction of this project. At present ARS is funding five
prototype mapping efforts. These are:
Corn (Maize)
Wheat
Soybean
Pine tree
Arabidopsis
The last of these, while being effectively a weed, is a good
model plant system, with a relatively small genome (about 70
million base pairs).
The third and last area consists of the actual sequences of the base pairs found in the DNA. Since this data is the same as the data which is going into GenBank/GenInfo/European Molecular Biology Laboratories (EMBL) databases (2), it was felt that there was no need to independently develop a plant sequence database. Hence ARS is putting all its sequence data into this one, universal, sequence database.
Thus, two of the three database activities are using
existing ongoing efforts, which it was felt, from both a
management and a scientific point of view was the appropriate
course of action to take. This leaves "only" the mapping
databases for ARS to be concerned with at this time. The
approach ARS has taken is rather simple. ARS is providing
funding for each of the five species listed above and asking a
particular lab to take the lead in coordinating all of data which
would go into the mapping database associated with that species.
Coordination includes obtaining the data (or running the
necessary experiments if there is no available data), performing
some evaluation and quality control on the data, and then putting
the species specific data into a local database system at the
coordinators home lab or institution.
The resulting public database is then sent to the US
National Agriculture Library (NAL) where it is integrated into a
master database system with data from all crop species (3). The
current NAL effort involves the use of the Sybase relational
database management system (DBMS) software, which is the same
software used by the human genome project researchers. In the
future, other database management systems based on object-oriented data structures will be explored. At present, the
resulting integrated relational database is being made available,
via Internet, to scientists from all over the world (4). In this
way, the various coordinators do not have the burden of having to
perform a service operation, in which they are generally not
interested, and most importantly, not at all qualified or
experienced to do. The NAL is, on the other hand, in the
business of service and support, and has the proper management
philosophy towards the future of libraries, which includes a
heavy emphasis on electronic distribution of information. In
addition to the integrated map data from the five species, the
NAL system has the full text of the AGRICOLA related literature
references searchable as part of the system capabilities. All
the data from the five databases are also full-text searchable,
providing the user with a very powerful and complete system.
Besides the online system available via Internet, NAL will
also disseminate the database on tape for those wishing to create
their own systems. Lastly, it is expected that there will be a
CD-ROM version of the database for those who prefer that medium.
In addition to the service role NAL will be performing another important function. NAL will be responsible for assuring that all of the databases coming from the mapping groups are consistent and standardized to the maximum extent possible.
For example, standards are being established or adopted by NAL
for gene nomenclature, literature citations, and terminology.
Prototype Database Development for Corn, Soybean, Wheat,
Arabidopsis, and Forest Tree Species
The assigned objectives of this ARS managed project are: a)
initiate beta-testing of the prototype generic database on the
genetics of these species and work in a cooperative effort to
establish a generic database for agronomic plants, and b)
engender a "grass roots" support base for the genome database
efforts, i.e., the scientists are participating in the
development of the database in contrast to a "top down" approach.
Funding was provided to locations at Ames, Iowa (soybean),
Columbia, Missouri (corn), Albany, California (wheat, forest
trees), and Harvard (Arabidopsis). The present structure of the
database program comes from sources other than just the five
laboratories mentioned above and goes to NAL Plant Genome
database through Internet.
Database design and related information were agreed to this
year by all participating groups. A demonstration of the system
was given at the Plant Genome I meeting and the acceptance and
attendance was overwhelming. A larger and more comprehensive
demonstration will be given at Plant Genome - II in January 1994
(5).
Priority database topics include Disease/Pathology, Genetic
Resources, Germplasm, Genetic Maps, Metabolic Pathways,
Organelles, Quantitative Traits, and Database Quality Control as
determined by ARS in cooperation with scientists from the
public and private sectors. Groups continue to explore specific
user needs for each of these topics. To facilitate and expand
data collection, assimilation, and database definition for some
of these topics, specific cooperative agreements were arranged
with researchers working with genetic mapping (RFLP), and other
mapping procedures, nitrogen metabolism, oil and fatty acid
biosynthesis, genetic collections, evolution, and software
development. Progress and problems pertaining to these priority
topics and specific cooperative agreements have been employed to
better address specific assigned database problem areas.
Additional specific cooperative agreements with various
agricultural research labs around the USA and in foreign
countries are being drafted for database work dealing with
diseases, pathology and molecular mapping. Agreements between
ARS and the Computing Sciences Division, Lawrence Berkeley
Laboratories, Berkeley, CA and Yale University for the design and
implementation of the representative species genome database have
also been established. The net result of these cooperative
projects has been a carefully planned and well agreed to database
and access system which should help assure its success in the
years to come as the system grows and world-wide agriculture
community uses the system as part of the everyday research
activities of scientists in labs throughout the world.
Centralization of Database Activities at the National
Agricultural Library
NAL has established the Plant Genome Data and Information
Center (PGC) in support of the Plant Genome Program with an
allocation of about $1 million per year from ARS. The major
accomplishments which this funding has provided can be divided
into three areas: the Plant Genome Database, Information
Dissemination, and Bibliographic Materials Enhancement.
Plant Genome Database (PGD)
The initial design of the Plant Genome Database (PGD) is
completed. It already includes phenotypic trait and germplasm
data. Metabolic data remain as a design issue. More raw data
will be placed in the database. The design is generic; no
species specific design elements exist at present. This will
ensure that the database easily accommodates future expansion to
other species without major design changes. However, data input
and further developments are necessary. The database is now in
beta-test and will be released in 1994. It's current contents
are shown in Table 2.
3.3 Megabytes of Data
3.3 Million records
300,000 References
7,200 Sites
108 Maps
4,430 Allele Variations
21,000 Stocks
2,593 Traits
340,000 Phenotypes
Existing databases, such as GenBank, PIR, and GRIN, are (for
the first two) and will be (for GRIN) linked into the Plant
Genome Database. This integration is essential to guarantee the
maximum utility of the data while minimizing duplication of
effort with centralization of agricultural information as is tied
in with the National Institutes of Health Human Genome Database
and the database of the EMBL.
Information Dissemination
NAL has established an information center as part of the PGC
that is responsible for disseminating genomic information to the
public. The Center responds to requests for specific types of
information, reviews all available sources, and reports back to
the requestor with the finds. An outreach program has been
developed to bring the Plant Genome Program to the scientific
community at major scientific conferences and research locations.
An example of this is Probe(6), the quarterly newsletter of the
Plant Genome Program. Probe contains updates about the progress
of the program as well as articles from some of the top
researchers in genetics; national and international circulation
for the first issue exceeded 6,000. The Center also develops
technical, subject-oriented publications. These publications
include directories of experts; bibliographies related to
methodologies and software; listings of sequence genes. Other
information products will be produced to meet the information
needs of the program and the scientific community.
Bibliographic Materials Enhancement
NAL's AGRICOLA bibliographic database system has been
improved in three fundamental ways: first, the focus of the
system has been altered to expand coverage of genetic information
sources; second, the quality of the bibliographic records have
been enhanced by adding extensive abstracts; and third, new
keywords, which identify specific records containing genomic
data, have been added. All of these enhancements greatly improve
a user's ability to retrieve relevant information. This is
especially true when one considers that these enhancements are
being linked with the other data in the Plant Genome Database
(PGD).
Examples of results from the Plant Genome Database
Providing examples of interactive computer systems for a
journal paper is really an impossible task, so only "snapshots"
of the system will be given. It should be mentioned that this
problem has been overcome in oral presentations in many modern
lecture halls by using a computer system connected to the
projection screen or by using a previously prepared video tape
(in the proper NTSC, PAL, or SECAM format).
Figure 1
show the main menu and
Figures 2a-c
show the results from a
real search (for all stock names beginning with the letters "ch").
Figures 3a-b shows the results from a reference search.
Any of these figures (screen dumps from the computer screen) can
be sent, via electronic mail to anyone who has an e-mail address
on the Internet. For explanations of each figure please refer to
the text beneath the figure or read the figure captions listed after the
references at the end of this paper.
CONCLUSIONS
The Plant Genome Database (PGD) is now a real and
functioning information and data resource for agricultural genome
researchers. As the system increases in size and intellectual
content is value will greatly increase and enhance the abilities
of researchers to undertake more sophisticated genome research.
which will ultimately benefit the farmer community and all
consumers.
1. J. D. Mowder and A. K. Stoner, "Information Systems", Plant Breeding Review, 7, pages 57-65 (1989); b) for further details and an information brochure, contact the GRIN - The Germplasm Resource Information System, Database Manager, USDA, ARS, Bldg. 003, Beltsville MD 20705-2350. Phone:1-301-504-5666; FAX: 1-301-504-5536.
2. See Science, 22 October 1993, pages 502-505.
3. For further details contact the Plant Genome Information Center, USDA, NAL, 10301 Baltimore Blvd., Beltsville, Maryland 20705-2351, USA. Phone: 1-301-504-6875; FAX: 1-301-504-7098 or e-mail to SMCCARTHY@ASRR.NALUSDA.GOV.
4. Access to the database is available via Internet as follows:
a) telnet to PROBE.NALUSDA.GOV; b) login as "PGD' with "GENOME"
as the password; c) follow the instructions; d) fill out the
online registration form if you have not done so. If you are
having any problems with access to the system, please send an e-mail message to: PGENOME@NALUSDA.GOV.
5. For information on the Plant Genome meeting series please contact : Scherago International, Inc., 11 Penn Plaza, Suite 1003 New York, NY 10001 (Phone: 212-643-1750; FAX: 212-643-1758).
6. Probe, the Newsletter for the USDA, Plant Genome Research
Program, Order from USDA, NAL, 10301 Baltimore Blvd.,
Beltsville, Maryland 20705-2351, USA. Phone: 1-301-504-6875;
FAX: 1-301-504-7098 or e-mail to SMCCARTHY@ASRR.NALUSDA.GOV.
1. A computer screen dump of the Main Menu for PGD (note that Sites are all Mapped objects). This, and all the other figures are screen dumps from a Sun UNIX workstation.
2a. A computer screen dump showing the results of a search for all stock names beginning with the letters "ch".
2b. The Detail Screen for the stock names "Chinese Spring" (selected from the choices in Figure 6).
2c. The Detail Screen of a collection (ARS - National Small Grain Collection) where the stock Chinese Spring can be found. This screen was brought up by going to "Stock" from the Main Menu and searching for "chinese Spring". Then the Collection option was selected from the Go TO menu.
3a. A computer screen showing the results of a search for all maps for all species. [This screen was reached by choosing the map option off of the Main Menu.]
3b. This computer screen is the detail of a reference record retrieved
from the search done in
Figure 3a. Note that the entire abstract exists and can be viewed by the
user, but that it will not
all fit on the computer screen at the same time. However, the user can
scroll through the
abstract to view all the pages.