- Training Courses
- Workshops
- Grants & Fellowships
- European Conference in Functional Genomics
- Meeting Reports
- Online Registration

 

 

Curation of Databases in Molecular Biology

Report
Presentations from the speakers
List of participants

Report ... (draft)

The motivation for this course stems from the crucial importance to the entire biomedical community of high quality databanks. These databanks include archival databanks of nucleic acid sequences, of protein sequences and of biological macromolecular structures. Other, specialised, databanks based on these data present selected slices of them in different ways. The contents of the databanks include both primary experimental data, such as DNA sequences, and annotations, for instance identification of genes in DNA sequences.

The experimental techniques for determining the primary data are generally of quite high quality, and robust estimates of error are available. However, annotation is the weak part of the enterprise. Much annotation is created automatically by computer programs. Human checking of the results is very difficult and labour-intensive. As a result, the situation is that databank entries contain very high quality primary data and annotations that are far less trustworthy. Nevertheless, correct annotation is essential for applications of databanks to research. Enhancements in computer-generated annotation will contribute to improvements of the situtation, but there is consensus that in the end human curation will be necessary.

For these reasons it was felt that the course in database annotation would be useful, in the following ways:

(1) It would present in a coherent form, to students preparing for careers in bioinformatics, the nature of the problems of annotation in the current state of the databanks. Indeed, a goal of the course was the assembly of an appropriate panel of people -- both faculty and students -- such that a useful description of the state of the problem would emerge from the presentations and the discussions.

(2) It would develop suggestions for how the situation could be ameliorated, and what kinds of training would be required to prepare scientists to contribute to this enterprise.

(3) It would prepare materials available over the web site, originated by the speakers, for other interested scientists to learn about annotation, its problems, and potential methods for improvements. Other courses or presentations could be based on this material; and bioinformatics students could be directed to it as study material.

The course was well attended and the presentations and discussions were lively. (It was unfortunate that the course took place so soon after the tragic events of 11 September that some participants dropped out at the last minute.) Nevertheless, the room was full and the discussion robust. As organiser I feel that the course was a success, and informal comments from other participants confirm this impression.

The presentations treated several fundamental themes. One of these was the analysis of genomic sequences. The annotation of genomes (identifying genes and trying to infer structures and functions of the corresponding proteins) was the subject of the talks by J. Parkhill and G. Mannhaupt. Annotation of eukaryotic genomes is more difficult than the prokaryotic case, because of the need to infer correctly-spliced genes from the raw DNA sequence. In semi-automatic annotation, a program presents a curator with results of programs from which an annotation can be chosen interactively by an expert. Programs increase in accuracy if they are trained on organism-specific learning sets. Integration of as much experimental information as possible improves quality of annotation.

C. Wu treated protein sequence databases, and described some of the problems associated with uncritical transferring of annotation among entries in the same or different databases.

S. Velankar treated databases of macromolecular structures, the format of their entries and their annotation, and linking of macromolecular structural databases with other databases in molecular biology. databases. A comprehensive database called integr8 under development at the European Bioinformatics Institute will encompass information on (1) Genetic regulatory elements, (2) gene expression, (3) protein expression, (4) gene product information, (5) protein families and domains, (6) structural and ligand information, (7) molecular function, (8) biological role, (9) location of gene products.

M. Krichevsky gave an introduction to taxonomy, an essential but treacherous component of correct annotation; the subject is one of which many molecular biologists know far less than they should.

G.J.L. Kemp and T. Etzold discussed general ideas of architecture of databases for molecular biology. Kemp discussed models of database interactivity. The problem arises because of the need to address queries to a large, disparate and heterogeneous set of databases, which may be individually constructed with different database architectures, including simple 'flat files', relational databases or object-based. He distinguished between tight and loose coupling among databases. Etzold described networks of databases in molecular biology and attempts to integrate them, including hypertext links and indexed links.

In addition to the invited speakers, three of the students presented databases to the construction of which they themselves had contributed. G.H. Thomas presented Echobase, a database of the organism E. coli including the genome and assignment of genes and the functions of the corresponding proteins. A. Zanzoni presented MINT, a molecular interaction database. G. Barker presented westDB, a wheat ES database for the cereals IGF progam. The database contains functional assignments of genes in cereals.

In conclusion, I feel that this course has made a contribution to the appreciation of an important problem in contemporary bioinformatics and to the initiation of efforts to solve it.

Arthur Lesk (Cambridge)

Presentations from the speakers...

Thure Etzold Structuring molecular biology databanks into a databank network
Graham Kemp Models of database interactivity
Micah Krichevsky Taxonomy and biological nomenclature
Gertrude Mannhaupt Design and sources of annotations
Julian Parkhill Distinction between measurements and interferences Curation of data and annotation Gene prediction
Cathy Wu Error propogation among databases Current problems of distribution formats How to modernise data models and not destroy the investment of the user community in software
Sameer Velanker
Structure Databases Annotation of Structure Formats of Structural Data
Gary Barker Wheat ESTs
Gavin Thomas Echobase: keeping pace with E.coli in the post-genomic era
Andreas Zanzoni MINT database on protein-protein interactions

List of participants ...

speakers:    
Arthur Lesk University of Cambridge, UK aml2@mrc-lmb.cam.ac.uk
Thure Etzold LION Bioscience Ltd.,Cambridge, UK Thure.Etzold@uk.lionbioscience.com
Graham Kemp University of Aberdeen, UK gjlk@csd.abdn.ac.uk
Micah Krichevsky Bionomics International, Rockville, USA micahk@eudoramail.com
Gertrude Mannhaupt MIPS, Muenchen, Germany G.Mannhaupt@gsf.de
Julian Parkhill Sanger Centre, Hinxton, U.K parkhill@sanger.ac.uk
Cathy Wu National Biomedical Research Foundation, U.S.A wuc@nbrf.georgetown.edu
Sameer Velanker
European Bioinformatics Institute, Hinxton, U.K sameer@ebi.ac.uk
participants:    
Ramon Alonso-Allende Spain allende@cnb.uam.es
Gary Barker UK gary-la.barker@bbsrc.ac.uk
Anton Beyer BEYER@imp.univie.ac.at
Vincent Collura France vcollura@hybrigenics.fr
Pasqualina D'Ursi Italy dursi@inb.mi.cnr.it
Jean Garnier France jgarnier@jouy.inra.fr
Jean-Francois Gibrat France gibrat@versailles.inra.fr
Mark Hoebeke France hoebeke@versailles.inra.fr
Katerina Hodanova Czech Republic khod@lf1.cuni.cz
Renate Kania Germany Renate.Kania@eml.villa-bosch.de
Annette Martin UK annette.martin@bbsrc.ac.uk
Luisa Montecchi-Palazzi Italy luisa@obelix.bio.uniroma2.it
Kristoffer Rapacki Denmark rapacki@tryptophan.cbs.dtu.dk
Wolfgang Schreiner Germany c/o Claudia.Holzer@akh-wien.ac.at
Francesca Scotto Lavina Italy francesca.scotto@unina2.it
Blanka Stiburkova Czech Republic bstib@lf1.cuni.cz
Gavin Thomas UK ght2@york.ac.uk
Christina von Gertten Sweden christina.vongertten@ks.se
Andrew Warry UK andrew.warry@bbsrc.ac.uk
Andreas Zanzoni Italy andreas@obelix.bio.uniroma2.it

 

 

 
The organisers of the meeting would like to thank CODATA for their assistance and support.
http://www.
codata.org