Curation
of Databases in Molecular Biology
Report
Presentations
from the speakers
List of participants
Report
... (draft)
The
motivation for this course stems from the crucial importance
to the entire biomedical community of high quality databanks.
These databanks include archival databanks of nucleic acid
sequences, of protein sequences and of biological macromolecular
structures. Other, specialised, databanks based on these data
present selected slices of them in different ways. The contents
of the databanks include both primary experimental data, such
as DNA sequences, and annotations, for instance identification
of genes in DNA sequences.
The
experimental techniques for determining the primary data are
generally of quite high quality, and robust estimates of error
are available. However, annotation is the weak part of the
enterprise. Much annotation is created automatically by computer
programs. Human checking of the results is very difficult
and labour-intensive. As a result, the situation is that databank
entries contain very high quality primary data and annotations
that are far less trustworthy. Nevertheless, correct annotation
is essential for applications of databanks to research. Enhancements
in computer-generated annotation will contribute to improvements
of the situtation, but there is consensus that in the end
human curation will be necessary.
For
these reasons it was felt that the course in database annotation
would be useful, in the following ways:
(1)
It would present in a coherent form, to students preparing
for careers in bioinformatics, the nature of the problems
of annotation in the current state of the databanks. Indeed,
a goal of the course was the assembly of an appropriate panel
of people -- both faculty and students -- such that a useful
description of the state of the problem would emerge from
the presentations and the discussions.
(2)
It would develop suggestions for how the situation could be
ameliorated, and what kinds of training would be required
to prepare scientists to contribute to this enterprise.
(3)
It would prepare materials available over the web site, originated
by the speakers, for other interested scientists to learn
about annotation, its problems, and potential methods for
improvements. Other courses or presentations could be based
on this material; and bioinformatics students could be directed
to it as study material.
The
course was well attended and the presentations and discussions
were lively. (It was unfortunate that the course took place
so soon after the tragic events of 11 September that some
participants dropped out at the last minute.) Nevertheless,
the room was full and the discussion robust. As organiser
I feel that the course was a success, and informal comments
from other participants confirm this impression.
The
presentations treated several fundamental themes. One of these
was the analysis of genomic sequences. The annotation of genomes
(identifying genes and trying to infer structures and functions
of the corresponding proteins) was the subject of the talks
by J. Parkhill and G.
Mannhaupt. Annotation of eukaryotic genomes is more
difficult than the prokaryotic case, because of the need to
infer correctly-spliced genes from the raw DNA sequence. In
semi-automatic annotation, a program presents a curator with
results of programs from which an annotation can be chosen
interactively by an expert. Programs increase in accuracy
if they are trained on organism-specific learning sets. Integration
of as much experimental information as possible improves quality
of annotation.
C.
Wu treated protein sequence databases, and described
some of the problems associated with uncritical transferring
of annotation among entries in the same or different databases.
S.
Velankar treated databases of macromolecular structures,
the format of their entries and their annotation, and linking
of macromolecular structural databases with other databases
in molecular biology. databases. A comprehensive database
called integr8 under development at the European Bioinformatics
Institute will encompass information on (1) Genetic regulatory
elements, (2) gene expression, (3) protein expression, (4)
gene product information, (5) protein families and domains,
(6) structural and ligand information, (7) molecular function,
(8) biological role, (9) location of gene products.
M.
Krichevsky gave an introduction to taxonomy, an essential
but treacherous component of correct annotation; the subject
is one of which many molecular biologists know far less than
they should.
G.J.L.
Kemp and T. Etzold discussed
general ideas of architecture of databases for molecular biology.
Kemp discussed models of database interactivity. The problem
arises because of the need to address queries to a large,
disparate and heterogeneous set of databases, which may be
individually constructed with different database architectures,
including simple 'flat files', relational databases or object-based.
He distinguished between tight and loose coupling among databases.
Etzold described networks of databases in molecular biology
and attempts to integrate them, including hypertext links
and indexed links.
In
addition to the invited speakers, three of the students presented
databases to the construction of which they themselves had
contributed. G.H. Thomas presented
Echobase, a database of the organism E. coli including the
genome and assignment of genes and the functions of the corresponding
proteins. A. Zanzoni presented
MINT, a molecular interaction database. G.
Barker presented westDB, a wheat ES database for the
cereals IGF progam. The database contains functional assignments
of genes in cereals.
In
conclusion, I feel that this course has made a contribution
to the appreciation of an important problem in contemporary
bioinformatics and to the initiation of efforts to solve it.
Arthur
Lesk (Cambridge)
Presentations
from the speakers...
List
of participants ...
| speakers: |
|
|
| Arthur
Lesk |
University
of Cambridge, UK |
aml2@mrc-lmb.cam.ac.uk |
| Thure
Etzold |
LION
Bioscience Ltd.,Cambridge, UK |
Thure.Etzold@uk.lionbioscience.com |
| Graham
Kemp |
University
of Aberdeen, UK |
gjlk@csd.abdn.ac.uk |
| Micah
Krichevsky |
Bionomics
International, Rockville, USA |
micahk@eudoramail.com |
| Gertrude
Mannhaupt |
MIPS,
Muenchen, Germany |
G.Mannhaupt@gsf.de |
| Julian
Parkhill |
Sanger
Centre, Hinxton, U.K |
parkhill@sanger.ac.uk |
| Cathy
Wu |
National
Biomedical Research Foundation, U.S.A |
wuc@nbrf.georgetown.edu |
Sameer
Velanker
|
European
Bioinformatics Institute, Hinxton, U.K |
sameer@ebi.ac.uk |
| participants: |
|
|
| Ramon
Alonso-Allende |
Spain |
allende@cnb.uam.es |
| Gary
Barker |
UK |
gary-la.barker@bbsrc.ac.uk |
| Anton
Beyer |
|
BEYER@imp.univie.ac.at |
| Vincent
Collura |
France |
vcollura@hybrigenics.fr |
| Pasqualina
D'Ursi |
Italy |
dursi@inb.mi.cnr.it |
| Jean
Garnier |
France |
jgarnier@jouy.inra.fr |
| Jean-Francois
Gibrat |
France |
gibrat@versailles.inra.fr |
| Mark
Hoebeke |
France |
hoebeke@versailles.inra.fr |
| Katerina
Hodanova |
Czech
Republic |
khod@lf1.cuni.cz |
| Renate
Kania |
Germany |
Renate.Kania@eml.villa-bosch.de |
| Annette
Martin |
UK |
annette.martin@bbsrc.ac.uk |
| Luisa
Montecchi-Palazzi |
Italy |
luisa@obelix.bio.uniroma2.it |
| Kristoffer
Rapacki |
Denmark |
rapacki@tryptophan.cbs.dtu.dk
|
| Wolfgang
Schreiner |
Germany |
c/o
Claudia.Holzer@akh-wien.ac.at
|
| Francesca
Scotto Lavina |
Italy |
francesca.scotto@unina2.it |
| Blanka
Stiburkova |
Czech
Republic |
bstib@lf1.cuni.cz |
| Gavin
Thomas |
UK |
ght2@york.ac.uk |
| Christina
von Gertten |
Sweden |
christina.vongertten@ks.se |
| Andrew
Warry |
UK |
andrew.warry@bbsrc.ac.uk |
| Andreas
Zanzoni |
Italy |
andreas@obelix.bio.uniroma2.it |
|