Data
integration in functional genomics and proteomics
Report
Abstracts
List of participants
Photos
Report
... (draft)
Introduction
Beyond
an analysis of gene function at the genome level, the ultimate
goal of functional genomics is to understand the organisation
and coordinated operation of the cell. Integration of data
and information is an essential feature of all the steps leading
from the production of experimental results to the modelling
of a complete cell. The Geneva workshop, organised within
the framework of the European Science Foundation (ESF) Programme
on Integrated Approaches for Functional Genomics, provided
an excellent vantage point from which to review the issue
of data integration from various perspectives. It brought
together scientists with different backgrounds (biologists,
bioinformaticians), who currently participate in consortia
involving or requiring the integration of heterogeneous biological
data. The workshop dedicated equal time to presentations and
discussions in order to optimise knowledge distribution and
sharing.
The
first session was devoted to existing functional genomics
projects, where a diverse set of experimental approaches are
applied to a model organism or given project. For these, both
the scientific objectives and data management strategies were
described. Another set of presentations focused on 'data integration
requirements' in proteomics. In this rapidly evolving area,
data integration is a key issue for technological developments
that are being carried out in both academic and industrial
contexts. Data integration was then presented by scientists
involved in the design and maintenance of public database
resources. The last session was devoted to bioinformatics
and gave an overview of state of the art approaches for building
up information systems with good capabilities with respect
to data integration. The workshop concluded with an open discussion
on bottlenecks and perspectives that emerged from the presentations.
Session
on Bottlenecks and Perspectives
The
last session had the format of an open discussion based on
a number of predefined topics.
|
Possibility
of sharing data and database schema. There is an
obvious need to compare and perhaps exchange database
schema between ongoing projects. As presented here,
most of the projects involve creation of a database
and all developers had encountered the difficulty of
heterogeneity of format and data description in existing
databases. The discussion highlighted the practical
difficulties of sharing database schema from different
networks, stemming from several reasons. Some databases
are not at a stage suitable for sharing information
or are not stable yet. In addition, there are intellectual
property issues and restrictions from contracted partners
(commercial or otherwise). The discussion then raised
the issue that the EU has only recently started to fund
bioinformatics activity in most of these projects. It
was suggested that public funding ought to be related
to public access to the information in the databases
and their schema. This is particularly important as
numerous publications are based on data that are not
publicly available. Therefore post-project maintenance
of the databases is also required, which needs additional
funding. Even if funding of sequence databases, such
as SWISS-PROT or EMBL, is more or less stable today,
this is not the case for the newer types of functional
genomics related databases. There is currently no source
of EU funding for curating these highly important data.
Currently, at the end of the granted period, there is
no rule and no funding assigned to the maintenance of
the data collected, hence posing a considerable threat
that all of this valuable data may be lost. This topic
has to be further discussed and practical solutions
reached.
A
consensus was agreed amongst the projects that have
been running for more than one year that interaction
between biologists and bioinformaticians has to be improved.
The current lack of interaction relates to the difficulty
bioinformaticians face in predicting the needs of biologists
and evaluating the future possibilities of this rapidly
evolving technology. The current tactic uses a pragmatic
approach of matching a 'prototype' model. This then
requires specific redesigning and optimisation. Only
recently could the MGET and EU PLANET projects define
a workpackage specific to dealing with this issue. Clearly,
biologists need to understand the capabilities of bioinformatics
to extract and interpret the masses of data that are
generated by their research to assist them further.
Status
of modelling and systems biology. A new and far-reaching
area, this topic was generally not represented in the
various projects here. The issue pertains to the collection
of large amounts of data from extraneous sources, which
is not always possible within these projects. In addition,
the lack of common crossover between databases is currently
a limiting factor. A proposal has been submitted for
an ESF programme workshop in 2002, focusing on this
issue. The same arguments are true when initiating data
mining activities; here, a workshop may be scheduled
for 2003.
|
Conclusions
and Future
The
workshop, as an exploratory experience, was considered very
interesting and successful by a vast majority of the participants.
The broad range of individual expertise raised interesting
issues and highlighted the need for multidisciplinary crossover
between the partners of EU functional genomics projects. This
workshop was aimed at assembling combined knowledge about
the current status of these European projects, which was a
unique opportunity. In addition to the comments made above,
some further proposals have been made for future activities
and improvements to the format of workshops such as this.
Now that an initial status on the various projects has been
assessed, further shorter workshops focusing on more specific
topics should be organized, perhaps as satellites of conferences,
in order to maintain the momentum of sharing and integration
begun here. The current ESF programme may finance several
training courses or personnel exchanges, within the programme
areas, in order to facilitate crossover and integration of
bioinformatics and biology. Efforts are still being made to
facilitate and enhance the content (quality and format) of
databases and to provide the end users with powerful and transparent,
optimized tools. This may be made possible by encouraging
better communication between established and future projects.
Note:
REALIS, REGIA,
BACELL, EBP
are funded under the Fifth Framework Programme by the Quality
of Life and Management of Living Resources Programme (http://europa.eu.int/comm/research/quality-of-life.html)
of the European Commission. Description of the projects can
be found on the Community R&D Information Service (CORDIS)
server (http://www.cordis.lu/).
Abstracts
...
Session
on Functional Genomics Projects: scientific issues and status
Uwe
Kaerst (Gesellschaft für Biotechnologische Forschung,
Braunschweig, Germany) introduced the European Union funded
REALIS project (Molecular
strategies for adaptation and survival: Post-genomic analysis
of the lifestyles of Listeria monocytogenes (LM) in
the environment and the infected host) which commenced in
February 2000. As the LM genome is already known, the project
aims to completely decipher all the genes required for survival
and adaptation of LM within an infected host and for responses
to the external environment. Using genomic and post-genomic
tools, REALIS also seeks to precisely address questions regarding
the evolutionary relationships between pathogenic and non-pathogenic
Listeria and to define the qualities of particularly successful
clonal pathovariants which cause disease. Workpackages are
focused on transcriptomics, proteomics, large-scale generation
of mutants, central regulon analysis, comparative genomics
and bioinformatics. The latter workpackage aims to develop
an integrated bioinformatics database, based on the SRS6 system.
Peter
Rice (LION Bioscience, UK) described
RIBDB, an integrated database of Listeria
experimental data. Based on SRS6, it is hosted on a central
website and acts as the REALIS consortium support for internal
queries. It is currently populated with public database sequences
of Listeria, gene lists, annotations for L. monocytogenes,
expression data and proteomic spot annotations.
Philippe
Glaser (Institut Pasteur, Paris, France) presented the
comparative genomics activity that will be implemented as
part of the REALIS project.
He depicted phylogenetic relationships between various Listeria
spp., Bacillus subtilis and Mycobacterium genitalium.
Symmetry, similarities and structure conservation were discussed,
based on analysis of genomic sequences.
Javier
Paz Ares presented the aims and structures of the
REGIA (Regulatory Gene Initiative in Arabidopsis)
project, which started in February 2000 with two years of
funding. It is mainly focused on the characterisation of transcription
factors (TF) and their relevance to plant breeding programs.
The seven workpackages include expression patterns, identification
of mutants and TF loci, ectopic expression, phenotype analysis
using high throughput metabolic profiling, interaction using
two-hybrid screening (Y2H) and bioinformatics. Proteomics
is not included in this project. About 1500 TFs are recognised,
for which, about 420 zinc finger proteins have thus far been
predicted.
The
information storage, analysis and web representation of REGIA
was presented by Alfonso Valencia (Centro Nacional
de Biotecnologia, Madrid, Spain). About 30 groups interact
with the database. The architecture uses XML to transfer the
data to the central relational database. No raw data are introduced.
While the metabolic data and Y2H data are formalised and normalised,
there is no such standard for transcriptional and mutant analyses.
There is a fundamental difficulty in homogenising the experimental
and analytical procedures, when various groups do not employ
the same approaches. This is yet to be completely resolved.
Wilhelm
Stiekema (Wageningen University, The Netherlands) presented
the goals of the recently funded
EU PLANET (European Plant Genome Database Network).
With 8 partner groups, the workpackages include topics such
as ontology, high-density maps, data mining tools, haplomaps,
diversity maps, metabolome profiling, genomics, and proteomics.
The study involves analysis of gene function for plant protection.
Much of the available data, partially that coming from Germplasm
DB (6 mio accessions, molecular genetic fingerprints from
which there are more than 250 markers), will be combined with
data generated by the consortium to build an annotation pipeline.
This will probably be based on flat files and automated using
Perl scripts (Cyrille). Wlihelm also mentioned EXOTIC
(the Exon Trapping Insert Consortium), the aim of which is
to detect the spatial and temporal expression patterns of
approximately 5 000 Arabidopsis genes, and to identify and
characterise their regulatory elements.
Colin
Harwood (Newcastle University, UK) presented the activities
of BACELL (From gene regulation
to gene function: regulatory networks in the model Gram-positive
bacterium Bacillus subtilis). The 11 groups forming
this EU funded consortium divided their activities into four
workpackages. Transcriptomic and proteomic approaches were
used to observe responses to environmental changes (stimulons
and regulons) and to characterise regulatory proteins. Global
regulatory networks will be built using data integration and
existing databases, such as MICADO
and SubDB.
Peter
Jungblut (Max-Planck-Institute for Infection Biology,
Berlin, Germany) discussed the EBP
thematic network (Comparative analysis of proteome modulation
in human pathogenic bacteria for the identification of new
vaccines, diagnostics and antibacterial drug targets). It
constitutes 10 workpackages including studies on 6 pathogenic
bacteria. Besides its functional goals, the network aims to
build a database system, mainly incorporating proteomic and
also transcriptomic data. Two complementary alternative approaches
have been proposed to the consortium members: a centralised
database in Berlin and a system based on the federated model
of World-2DPAGE.
Clive
Edwards (University of Liverpool, UK) introduced a field
not frequently considered in standard approaches to organism
study, namely molecular ecology. He highlighted the importance
of monitoring active prokaryotes in their natural environments,
as these organisms are involved in key biogeochemical cycles
and may have pathogenic incidences. In the last 10 to 15 years,
molecular biology methods and particularly sequencing of 16S
rRNA have been used to study "functional genes".
The discovery of such genes does not necessarily reflect actual
activity, however. DNA array technology provides a possible
alternative. One major technical difficulty is obtaining the
ability to culture the vast majority of marine prokaryotes.
Proteomic approaches are proposed in order to identify active
prokaryotes and environmentally induced active protein synthesis,
using two-dimensional polyacrylamide gel electrophoresis (2-D
PAGE) coupled with pulse-labeling experiments and mass spectrometry
(MS). The cellular origin of these proteins could be observed
via fluorescently labelled probes.
Session
on Proteomics
Manfredo
Quadroni (University of Lausanne, Switzerland) addressed
the proteomics issues of goals, technologies and quality requirements.
While defining the first phase of proteomics analysis as the
production of experimental data, he discussed the difficulty
of capturing and comparing 2-D PAGE and MS results when there
exists such a degree of protocol heterogeneity between different
labs. The second obstacle corresponds to data handling and
analysis. He pointed out some requirements related to data
management, database usefulness and data quality assessment.
Here there is a need for detailed descriptions of sample origin
and preparation, relationship to expression levels, subcellular
localisation, 3-D structure, biological activity, description
of post-translational modifications and interpretation of
protein interaction maps, among others. He also discussed
the quality difference between curated and archived databases.
Ron
Appel (Swiss Institute of Bioinformatics, Geneva, Switzerland)
focused on quality issues applied to 2-D PAGE image analysis
and related databases. At the user level, efforts have to
be made to describe all parameters describing the generation
of 2-D PAGE images. These include sample choice and preparation,
the protocols for the 2-D PAGE separation, the gel staining
and the digitising steps. The image analysis software should,
on its side, deliver information on the type of algorithm
used for spot detection and matching, together with associated
parameters and a measure of confidence. The principle "garbage
in / garbage out" can be therefore understood as "quality
in / quality out". He presented an application of, and
the related challenges for, the Melanie image analysis software
(http://www.expasy.org/melanie),
and the advantages and issues of the Federated Model of 2-D
PAGE databases.
Joel
Vandekerkhove (University of Ghent, Belgium) presented
alternatives to 2-D PAGE based protein identification methodologies.
Besides the 'multi-dimension liquid separation protein identification
technique' (MudPIT) and the ICAT technique, he introduced
a new methodology involving specific amino acid labeling and
diagonal chromatography. He illustrated the advantages of
this technique, compared to 2-D PAGE based identification,
in enhancing sensitivity.
Denis
Hochstrasser (Geneva University, Geneva University Hospital,
GeneProt, Switzerland) introduced the principle and the capacities
of the molecular scanner, a parallel, high-throughput method
of analysing 2-D PAGE separated protein samples. He then presented
the industrial approach that GeneProt (http://www.geneprot.com)
has chosen to analyse entire proteomes with high sensitivity.
Session
on Databases
Amos
Bairoch (Swiss Institute of Bioinformatics, Geneva, Switzerland)
presented the current status of the SWISS-PROT
knowledge base (http://www.expasy.org/sprot)
and related databases. As the integrated data comes from various
types of sources, he pointed out that the maintenance of the
information and of the cross-references to external databases
is difficult. In the case of cross-references to Medline/PubMed,
current stability is good. However, it is difficult to link
directly to an article as the journals use different systems.
SWISS-PROT has cross-links to 46 databases and 21 extra created
on the fly. There are links to 38 2D-PAGE databases on the
web that are providing annotation for about 200 images. However
there are currently no public MS databases available on the
web. The DR lines (Cross-references) are in general difficult
to maintain, as various databases have unstable unique identifiers,
which necessitate manual curation. The CC database is used
to link information resources relevant to only a small number
of proteins. These types of annotation are particularly labour
intensive. A new feature is the OX line, which links to the
NCBI taxonomy. As taxonomy frequently changes, these lines
have to be updated frequently, almost every week. In order
to help annotation, controlled vocabularies are used for several
topics, including keywords, journal abbreviations, tissue,
plasmid name, catalytic activity, etc. At the sequence level,
up to 12,000 variants have been validated and annotated. There
are also links to other information, including synonyms, MS
data, cofactors, and pathways. In the future, SWISS-PROT will
also be distributed in XML.
Henning
Hermjakob (European Bioinformatics Institute, Hinxton,
UK) introduced present and future projects of the European
Bioinformatics Institute (EBI, http://www.ebi.ac.uk).
Tremblor, accepted in May 2001 for three years, is a 25-members
consortium. It will provide a highly integrated view of genomic
and proteomic data (Integr8)
by drawing on databases maintained at major bioinformatics
centres in Europe, and by creating new and important resources
for protein-protein interactions (IntAct),
structural (EMSD) and
microarray (DESPRAD) data.
Ivan
Moszer (Institut Pasteur, Paris, France) presented the
Subtilist DB, dedicated to Bacillus subtilis. Historically
based on MICADO and Sub2D
databases, the main features of Subtilist
include enhanced and verified information on contigs, genes
and proteins, EMBL entries and bibliographic references. As
well as sequence correction, applied to more that 200 genes,
a significant number of "unknown" genes could be
labelled and associated with functions. Subtilist is currently
being adapted to handle transcriptomics data that are produced
in the framework of the BACELL
project (above). The database design is done very carefully
in order to store all the information, which will be useful
to support quality control and expression data analysis. The
system can be extended with additional tools and links and
can also be generalised for use with other organisms.
Session
on Bioinformatics
Robert
Stevens (Manchester University, UK) presented a talk on
ontologies, a field which is increasingly used within the
world of genomics and functional genomics. He presented the
components of an ontology and the process of building one
in order to capture knowledge about a domain.
Paul
Van der Vet (University of Twente, Netherlands) talked
about C2M, a chemical
configurable middleware. The current difficulty experienced
in exchanging information from heterogeneous resources is
related to format multiplicity. Ideally, users should have
several tools at their disposal, such as a wrapper (data converter)
generator that uses high-level format description, a code
generator that turns these description into an appropriate
program code, a compiler, and a documentor that turns these
description into a human readable documentation. C2M is a
prototype that aims to meet these required specifications.
Anne
Morgat (INRIA, France) introduced Panoramix,
a 2-year project aimed at federating a set of knowledge bases
dedicated to microbial genome annotation. From 49 known microbial
genomes, two have been chosen as reference organisms. Four
activities are linked to Panoramix:
Genomix (genomic information from genome sequences),
Proteix (protein information, based on extraction from
SWISS-PROT), Organix (organic
compounds involved in biological processes) and Metabolix,
dealing with the description of metabolic pathways. Metabolix
faces challenges associated with heterogeneous user point
of views (e.g. chemists see chemical compounds, biochemists
see enzymes, geneticists see genes encoding proteins, computer
scientists see graphs encoding reactions). Anne highlighted
the difficulty of integrating inconsistencies and incomplete
data with existing generalised and specialised databases.
An example is the comparison of chemical compound names as
CAS numbers and/or structural representations.
Denis
Shields (Royal College of Surgeons, Ireland) presented
an approach for integrating genotyping data. He talked about
the EU prospective cardiology project,
MORGAM, and the World Health Organisation's MONICA
project (http://www.ktl.fi/monica/index.html).
The idea of the bioinformatics activity here is to integrate
genetic variation information coming from phenotypic, metabolic,
proteomic, transcriptomic and genetic data. The principle
of the planned architecture is to use a Laboratory Information
Management System (LIMS) to handle raw data and to centralise
them in a database for later statistical and data mining tools.
The design of the centralised database is still not finalised
and will depend on the type of data generated.
This
conference report and selected papers from the workshop are
published in the journal Comparative
and Functional Genomics, issue 3(1), Jan-Feb 2002, Copyright
© 2002 John Wiley & Sons, Ltd.
List
of participants ...
| Name |
|
Affiliation |
| Appel
|
Ron
D. |
Swiss
Institute of Bioinformatics, CMU, 1 rue Michel-Servet,
1211 Geneva 4, Switzerland |
| Azuaje
|
Francisco |
Trinity
College Dublin, Ireland |
| Bairoch
|
Amos |
Swiss
Institute of Bioinformatics, CMU, 1 rue Michel-Servet,
1211 Geneva 4, Switzerland |
| Bernardi
|
Luca
|
European
Media Laboratory GmbH, Villa Bosch, Schloss-Wolsbrunnenweg
33, D-69118 Heidelberg, Germany |
| Bessières
|
Philippe |
Mathématique,
Informatique et Genome (MIG)INRA - Route de St-Cyr, 78026
Versailles Cedex, France |
| Binz
|
Pierre-Alain
|
Swiss
Institute of Bioinformatics, CMU, 1 rue Michel-Servet,
1211 Geneva 4, Switzerland |
| Conesa
|
Anna
|
TNO
Nutrition and Food Research, The Netherlands |
| de
Daruvar |
Antoine |
LION
Bioscience, Bordeaux, France |
| Edwards
|
Clive |
Molecular
and Environmental Microbiology Research Group, School
of Biological Sciences, University of Liverpool, Liverpool
L69 7ZB, UK |
| Glaser
|
Philippe
|
Institut
Pasteur, Paris, France |
| Harwood
|
Colin
|
Department
of Microbiology and Immunology, Medical School, Newcastle
University, Newcastle upon Tyne NE2 4HH, UK |
| Hermjakob
|
Henning
|
European
Bioinformatics Institute, Wellcome Trust Genome Campus,
Hinxton, Cambridge CB10 1SD, UK |
| Hinnerk
|
Boriss
|
Aarhus
University, Department of Genetics and Ecology |
| Hochstrasser
|
Denis
F. |
Laboratoire
central de chimie clinique, Geneva University Hospital,
1 rue Micheli-du-Crest, Switzerland |
| Jungblut
|
Peter
R. |
Max-Planck
Institute for Infectionbiology, Berlin |
| Kaerst
|
Uwe
|
Dept.
Cell BiologyGBF- Gesellschaft fuer Biotechnologische ForschungMascheroder
Weg 1, D-38124 Braunschweig, Germany |
| Mangold
|
Véronique
|
Swiss
Institute of Bioinformatics, CMU, 1 rue Michel-Servet,
1211 Geneva 4, Switzerland |
| Marsh
|
Joan
|
John
Wiley & Sons, Ealing Broadway, London W5 5DB, UK |
| Martin
|
Annette |
The Babraham Institute, Cambridge, CB2 4AT, UK |
| Morgat
|
Anne
|
Inria
Rhone-Alpes, 655, avenue de l'Europe, 38330 Montbonnot-Saint
Martin, France |
| Moszer
|
Ivan |
Institut
Pasteur, Unité de Génétique des Génomes
Bacteriens, 28 rue du Docteur Roux, 75724 Paris Cedex
15, France |
| Paz
Arez |
Javier
|
Department
of Plant Molecular Genetics, Centro Nacional de Biotecnologia,
Campus de Cantoblanco, 28049 Madrid, Spain |
| Quadroni
|
Manfredo
|
Institute
of Biochemistry F408, University ofLausanne, Ch. des Boveresses
155, 1066 Epalinges, Switzerland |
| Rice |
Peter
|
LION
Bioscience, Great Britain |
| Shields
|
Denis |
Royal
College of Surgeons in Ireland |
| Stevens
|
Robert
|
TAMBIS
project, Manchester |
| Stiekema |
Wilhelm
J. |
Plant
Research International, Postbus 16, 6700 AA Wageningen,
The Netherlands |
| Taussig
|
Mike
|
The
Babraham Institute, Cambridge, CB2 4AT, UK |
| Trajanoski
|
Zlatko
|
Institute
of Biomedical Engineering, Krenngasse 37, 8010 Graz, Austria |
| Valencia
|
Alfonso
|
Protein
Design Group, CNB-CSIC Centro Nacional de Biotecnologia,
Cantoblanco Madrid 28049, Spain |
| van
der Vet |
Paul |
Dept. of Computer Science, University of Twente |
| Vandekerkhove
|
Joel
|
Flanders
Interuniversity Institute for Biotechnology and Department
of Biochemistry, Faculty of Medicine, Ghent University,
Ghent, Belgium |
Some
candid snaps from the meeting ...
The
organisers would like to thank Vanessa Lenglart (seen above,
bottom right) for a lovely evening of entertainment on the
piano.
|