|
The
ESF Programme on Functional Genomics workshop on
"Data Integration in Functional Genomics: Application
to Biological Pathways"
Swiss Institute of Bioinformatics, Geneva,
Switzerland, 22-24 September 2003
Organisers
| Pierre-Alain
Binz |
Swiss
Institute of Bioinformatics, Geneva, Switzerland and Department
of Biological Sciences and Bioinformatics, University
of Geneva, Geneva, Switzerland |
| Henning
Hermjakob |
European
Bioinformatics Institute, Hinxton, UK |
| Paul
Van der Vet |
Department
of Computer Science, University of Twente, Enschede, The
Netherlands |
Report
Abstract
We
report from the second ESF Programme on Functional Genomics
workshop on Data Integration, which had sessions on 'Current
status of the use of experimental information to create biological
pathways databases in existing consortia/projects', 'Pathways
as part of bioinformatics infrastructures', 'Design, creation
and formalization of biological pathways databases', 'Generating
and supporting pathway data', 'Interoperability of databases
with other external databases and standards' and 'Future perspectives'.
Key issues emerging from the discussions were the need for
continued funding to cover maintenance and curation of databases,
the importance of quality control of the data in these resources,
and how to facilitate the exchange of data and ensure the
interoperability of databases.
Introduction
The
integration of heterogeneous data and information is a key
issue in the field of functional genomics. Currently available
technologies are producing floods of results that have to
be stored, interpreted, validated and correlated with biological
significance. Databases have been created that collect information
on protein-protein interaction and biological pathways.
A first workshop on data integration in functional genomics
and proteomics was organised in Geneva in October 2001 [1,
and http://www.functionalgenomics.org.uk/sections/activitites/Reports/report_geneva_2001.htm].
It was organised within the framework of the European Science
Foundation (ESF) Programme on Integrated Approaches for Functional
Genomics (http://www.functionalgenomics.org.uk).
The goal was to bring together scientists with different backgrounds
(biologists, bioinformaticians) who were participating in
projects involving or requiring integration of heterogeneous
biological data. The theme 'data integration requirements'
in the framework of general functional genomics approaches
was extensively discussed, with a particular focus on the
proteomics related questions (a rapidly evolving area which
is a good example of data heterogeneity). The general discussion
led, among other propositions, to the proposal to organise
another workshop that would focus on a more specific aspect
of data integration issues. The second Geneva workshop, that
is reported here and in a number of separate contributions
published in this special issue, has therefore chosen to concentrate
the topic of data integration applied to description, interpretation
and understanding of biological pathways.
The
first session was devoted to the current status of the use
of experimental information to create biological pathways
databases in existing consortia/projects. The second session
focused on how activities in biological pathways are implemented
in existing or developing bioinformatics infrastructures.
The third session discussed the design, creation and formalization
of biological pathways databases. The fourth session was entitled
"Generating and supporting pathway data" and focused
on experimental data that support interpretation of biological
pathways or / and can be used to generate biological pathways.
The fifth session approached the technical aspects of database
interoperability and required standards. The last session
was dedicated to future perspectives.
The workshop has brought together scientists and bioinformaticians,
who are involved in major multidisciplinary projects as well
as in the development of functional genomics databases.
Some
of the participants submitted individual material for publication
in a special section of the journal Comparative and Functional
Genomics (CFG). This report summarizes the main presentation
and discussion points and includes some extended abstracts
of the presentations.
Sessions and subsequent discussions
Session
1: Current status of the use of experimental information
to create biological pathways databases in existing consortia/projects
The
session started with a report of the activities of the 2002
ESF workshop on molecular networks provided by Sergio Nasi.
He presented the molecular complexity that proteins display
in living systems. He described some of the theoretical approaches
that are developed to represent and model molecular networks.
It appeared from the discussion that most of the modeled systems
are based on known experimental data and are therefore no
real prediction yet. The need for closer communication between
biologists and theoreticians has been also mentioned. The
ESF report can be found at http://www.functionalgenomics.org.uk/sections/activitites/Reports/report_granada_2002.htm.
Sergio Nasi has submitted a review to the special section
in CFG that presents a nice source of information on available
web resources in the field of pathways, including some comments
on the efficiency and approaches.
The
status of four EU-funded projects since the Geneva ESF workshop
in 2001 were then exposed:
Uwe
Kaerst has presented the final status of the EU-funded consortium
named REALIS. The project aimed at a postgenomic analysis
of the Gram-positive human and animal pathogen Listeria monocytogenes.
He presented RibDB, the database created during the project.
As for other FP5 EU projects, there is no funding to mantain
such databases at the end of the project. The discussion obviously
highlighted the need of long-term funding plans for the maintainance
and updates of such databases, useful to the entire scientific
community.
Ramon
Alonso Allende has discussed the bioinformatics environment
of the REGIA (Regulatory Gene Initiative in Arabidopsis) consortium.
The Consortium members accumulate experimental protein and
genome data. This includes also phenotypic analysis of mutant
and transgenic species, expression arrays data and metabolic
analyses. The data are made available through integration
into the PlaNet network (see below). Difficulties to generate
an appropriate database model and design is linked with the
difficulty to cope with rapidly evolving experimetal technologies
and to make biologists aware of the constraints of database
developers.
The
PlaNet consortium of European plant databases has been presented
by Heiko Schoof and is submitted as an article in the special
section of CFG.
Colin
Harwood, while presenting the BACELL consortium, has highlighted
a number of issues that many databases are facing. One can
mention the problem of quality control on the data, that are
usually not peer-reviewed. As data originally come from different
experimental approaches and technologies, data are sometimes
considered as good and trustworthy if a correlation exists
between various datasources. In many consortia, there is a
difficulty to accommodate the generation of experimental data
that are provided towards the end of the project. The time
available for datamining activities is often reduced to a
minimum.
Note:
REALIS, REGIA, PlaNet, and BACELL are funded under the Fifth
Framework Programme by the Quality of Life and Management
of Living Resources Programme (http://europa.eu.int/comm/research/quality-of-life.html)
of the European Commission. Description of the projects can
be found on the Community R&D Information Service (CORDIS)
server (http://www.cordis.lu/).
Session
2: Pathways as part of bioinformatics infrastructures
Paolo
Romano discussed the issues of creating a network of biological
resources such as EBRCN (European Biological Resources Centers
Network) (http://www.ebrcn.org).
Combining persistence of the information with heterogeneity
of the data, formal descriptions of links between databases
or data sources are necessary. EBRCN has a follow up in the
CABRI (Common Access to Biological Ressource and Information)
(http://www.cabri.org)
EU project. CABRI is presented in a separate contribution
in the special section of CFG.
The
development of LIMas (LIMS for arrays) at the MRC in UK (http://www.mgu.har.mrc.ac.uk/microarray/limas/)
was presented by Sarah Webb. This system has considered standardization
for file and information exchange. The development of the
system has followed the MIAME compliance recommendation (http://www.mged.org/Workgroups/MIAME/miame.html).
PEDRO (http://pedro.man.ac.uk)
is also considered for the handling of proteomics data. An
extended abstract of the presentation is attached to the report.
The
Data Integration and Analysis for Medical Systems Biology
(DIAL) approach adopted in the Netherlands was introduced
by Johannes Van Beek. A contribution describes DIAL in detail
in the special section in CFG.
Anne
Morgat discussed ways to represent pathways from different
points of views. One common approach is to describe them with
components and relationships principles. The issues of interoperability
of databases and of methods were illustrated with the Genostar
(http://www.genostar.org)
environment and the GenoExpertBacteria (http://www-geb.inrialpes.fr)
project, which uses the ENZYME and KEGG databases and the
High-quality Automated and Manual Annotation of microbial
Proteomes (HAMAP) project developed at the Swiss Institute
of Bioinformatics.
Session 3: Design, creation and formalisation of
biological pathways databases
The
environment of the Arabidopsis Genome Database was discussed
by Heiko Schoof. He started with a list of functional requirements
that should be apprehended when starting a database integration
project. He presented the MIPS Arabidopsis thaliana DataBase
(MatDB) (http://www.mips.biochem.mpg.de/proj/thal/db/),
a federated database that make available, through a common
interface, genomic information from various databases. MetaMIPS
attempts to build metabolic pathways. He also mentioned the
Genome Research Environment (GENRE) project as a flexible
workhorse for the annotation of genome information. All genomes
being annotated by the MIPS group will move to GenRE (http://mips.gsf.de/genre/proj/genre/)
and allow for comprehensive annotation of complex genomic
features. The discussion was about the quality of data made
available. There seems to be a need for quality criteria before
data can be deposited in public databases.
Kristian
Axelsen described ENZYME (http://us.expasy.org/enzyme/),
a repository of information on nomenclature of enzymes based
on the IUBMB (International Union of Biochemistry and Molecular
Biology) (http://www.iubmb.unibe.ch/)
recommendations (http://www.chem.qmul.ac.uk/iubmb/).
The CAS registry numbers are also considered for the description
of involved chemicals. He pointed out the currently slow process
of attributing new classifications. He discussed some open
issues, such as the missing link to systematic information
on pathways, difficult control of accurate propagation of
correction or updates of EC numbers and of general information.
As the reactivity of an enzyme is dependent on its biological
environment and thermodynamic conditions, the experimental
conditions should be prominently present in the description
of an enzyme activity. Kristian introduced also the project
IntEnz (http://www.ebi.ac.uk/intenz/index.html),
which aims to act as central repository for enzyme nomenclature.
IntEnz will integrate information from ENZYME, BRENDA, IUBMB
Dietmar
Schomburg presented BRENDA (http://www.brenda.uni-koeln.de/),
a well-established comprehensive Enzyme Information System.
It contains information on enzyme reaction and specificity,
structure, function, isolation procedure, stability, taxonomic
occurrence, etc. The data, extracted from original literature,
is manually evaluated by scientists, thus reducing the amount
of low quality data.
Claudia
Choi introduced TRANSPATH® (http://www.biobase.de/pages/products/transpath.html),
a signal transduction pathway database, tightly linked to
TRANSFAC® (http://www.biobase.de/pages/products/transfac.html),
a transcription factor database. The information from TRANSPATH®
can be visualised using PathwayBuilder. Claudia Choi
has submitted an article to the CFG special section.
Ulrike
Wittig discussed the EML project to develop an integrated
database system for computational analysis and visualisation
of biochemical pathways. In that respect, she described a
classification system of chemical compounds to support complex
queries in pathway database. She presented also the first
version of BioBrowser, a tool that queries chemical compounds
by class and subclass types. For more details, see her paper
in the CFG special section.
Session 4: Generating and supporting pathway data
Ioannis
Xenarios presented DIP, a protein-protein interaction database
(http://dip.doe-mbi.ucla.edu/).
He embellished his presentation with warning messages, such
as one about the relative danger of interpreting protein-protein
interactions measured in screening experiments as real interactions
in vivo. Small changes in the experimental conditions might
drastically change the presence or absence of measured interactions.
Even further, the definition of a protein-protein interaction
might differ according to the context. A protein described
in SwissProt and considered only by its primary structure
might not correspond to the real interacting partner in an
experiment. He highlighted also the general lack of comment
on the reliability of experimentally measured interactions
in databases.
The
future of Swiss-Prot (http://www.expasy.org/sprot)
and TrEMBL (http://www.ebi.ac.uk/trembl/index.html),
becoming part of the UniProt project (http://www.expasy.org/uniprot/)
was introduced by Amos Bairoch. The database reached the million
entries and should double in 2-3 years. This is due to the
large amount of bacterial genomes sequenced (about one a week).
It is estimated that half of the SwissProt entries are enzymes.
The information concerning the activity, structure, membership
of a pathway, etc can be found in various fields of each entry.
With respect to pathways, an effort is made to leave the free-text
description format and go for a standardization of the representation,
as well as references pointing to pathway databases.
Djamel
Medjahed presented a system to generate virtual 2D maps of
proteins and to look for cancer clues from transformation
of experiments on microdissected tissues (TMAP for Tissue
Molecular Anatomy Project). See his paper in the CFG special
section for more details.
Session 5: Interoperability of databases with other
external databases and standards
Philipp
Reiser approached gene function by performing reverse engineering
of metabolic pathways in yeast. Starting from information
in KEGG (http://www.genome.ad.jp/kegg/)
and IUPAC chemical nomenclature (http://www.chem.qmw.ac.uk/iupac/),
he modeled biochemical reactions and built relations between
genes and pathways. This method should be able to browse the
relationship between nutrients and biochemically synthesized
compounds. He also described the Robot Scientist, a system
that helps designing experiments and management of a liquid
handling system using machine learning.
A
method to extract bioentities and relationships from Medline
abstracts was presented by Christian Blaschke. He pointed
out the difficulty to extract consistent information from
text. The nomenclatures for genes are variable and confused,
different genes are represented as homonymous abbreviations,
different words are used in different scientific communities
for the same bioentity, and gene names are often represented
as nested terms. Ways to look for sets of terms in sentences
instead of only terms have been discussed.
As
Henning Hermjakob was unfortunately ill, Ioannis Xenarios
talked about the Proteomics Standards Initiative (PSI) (http://psidev.sourceforge.net).
PSI is a bioinformatics initiative of the Human Proteome Organisation
(HUPO) that develops standard data formats, representation,
exchange and annotation in proteomics. In collaboration with
the MGED consortium (http://www.mged.org)
and the American Society of Testings and Materials (ASTM)
(http://webstore.ansi.org),
it proposes recommendations and an XML format for data exchange.
It orients its activities along three complementary axes:
protein-protein interaction, mass spectrometry and general
proteomics.
Frederique
Lisacek has presented an innovative approach to shape biological
knowledge. She described environments to improve characterization
of protein annotation by combining information from various
complementary sources, i.e. databases, prediction tools and
human expert input. More details can be found in her contribution
as paper in this section of CFG.
Session 6: Perspectives, general discussion
A
general discussion took place at the end of the workshop.
The topics were chosen according to the discussion points
debated already in the first five sessions.
Sustainability
of databases
The difficulty of finding funds to maintain and update databases
that were financed by EU projects and other time-delimited
projects was raised in the previous Geneva ESF workshop on
data integration in 2001 and confirmed again this year. According
to comments made in the discussion, UK grants stipulate that
databases will be maintained but it is not explicitly stated
where funds will come from. On the other hand, it seems that
UK employment legislation will affect career structure of
post-docs. This might result hopefully in having more scientific
offices that could manage such databases. The question of
the role of the European Bioinformatics Institute (EBI) has
been raised. As EBI has a service role of providing access
to databases, EBI could be involved in the design phase of
all databases that might be, at the end of a time-defined
funded project, asked to be made available on its site. This
implies the description and development of guidelines that
can include the compatibility conditions for these databases
to fulfill. Another approach is to propose that the interested
scientific communities or even societies should apply and
get money to maintain these databases themselves. Maybe a
EU Network of Excellence should be created that can implement
the issue of sustainability of biological databases.
Quality
of information in databases, propagation of errors
During this workshop, questions about quality of information
have often been asked to database developers and providers.
Many databases lack references to the experimental sources
of the information, so that appropriate tracking and interpretation
of the biological quality is hindered. Another issue is that
the information provided is not always linked with the description
on how the data has been manipulated, integrated and interpreted
between the original experimental source (experiment or original
publication) and the final state of the database content.
Quality of data and confidence degree is linked with the consideration
that end-users and database providers are used to quote limitations
and power of specific technologies. It seems not always mandatory
for databases to contain biological interpretation of deposited
data. However, databases should clearly mention experimental
or literature evidences for each data. Proposals for the organization
of a future workshop that should address specifically the
quality question in databases have been made.
Conclusions
The
participants have found the workshop interesting, fruitful
and in need of a continuation. The presentations have generally
been the starting point of many discussions. Questions related
to the quality and to the sustainability of the information
provided by consortia or databases have regularly arisen.
It appeared clearly that this theme has to be addressed even
deeper in a future workshop. Technical and functional solutions
for facilitating the exchange of information between data
sources have to be proposed. The examples of the MGED consortium
and of the PSI initiative present good models of standardization
of data and information representation in biological pathways
databases. The funding agencies, particularly the EU, should
be approached and made aware of the need of the scientific
community to have access to funding possibilities for finding
solutions to these issues. A decision to submit a proposal
for an ESF workshop that addresses the quality issues has
been approved. Tentatively, Martin Hofmann, Paul van der Vet
and Pierre-Alain Binz will be in charge of this proposal.
Acknowledgements
We
would like to thank Mike Taussig and Annette Martin for the
support in the organization from the European Science Foundation
(ESF) Programme on Integrated Approaches for Functional. Thanks
also to Genomics Jazztime, the Dixieland band that entertained
the participants on the Geneva Lake during their trip to the
place of the social dinner, i.e. the restaurant Creux-de-Genthoz.
Thanks also to Joan Marsh, who has taken charge of the notes
during the workshop. Finally, we are deeply thankful to Laure,
Claudia, Dolnide and Veronique for the local logistic organisation.
[1] Binz P.-A., Martin A., Taussig M., de Daruvar A. The ESF
Programme on Integrated Approaches for Functional Genomics.
Workshop on "Data Integration in Functional Genomics
and Proteomics" (2002) Comp Funct Genom 3: 16-21
Abstracts
Uwe
Kaerst
The
REALIS project aimed at a postgenomic analysis of the Gram-positive
human and animal pathogen Listeria monocytogenes. The scientific
objectives were (i) to study the evolution of a pathogenic
organism by comparative genomics of L. monocytogenes, L. innocua
and clonally successful pathovariants, (ii) the development
of postgenomic strategies to provide a complete picture of
the remarkable adaptive abilities of this food-borne pathogen,
(iii) the understanding of the molecular mechanisms by which
environmental clues are perceived and translated into adaptive
responses and (iv) the establishment of an integrated bioinformatic
database incorporating information on biological pathways
[1].
A central research area was the identification of regulatory
networks in Listeria. This work primarily focused on virulence-associated
proteins starting with the known virulence gene cluster and
its central transcriptional regulator, PrfA. As in silico
analyses indicated a significantly larger extent of this regulatory
unit transcriptomics and proteomics analyses were carried
out to define the complete PrfA regulon. Comparing several
mutants and growth conditions three groups of genes could
be defined that responded, e.g, to the amount of PrfA; there
was, however, only a partial overlap in the genes or proteins
identified by both strategies. Furthermore, several genes
were obviously only indirectly regulated by PrfA as no specific
promoter could be detected upstream of these genes. A large
number of these genes are very likely controlled by the alternative
sigma factor B, the general stress response regulator, that
was shown to interact with the regulation of PrfA itself.
Transcriptomics and proteomics analyses of the SigB regulon
identified 106 genes controlled by SigB, one of which with
no SigB-specific promoter and only partially overlapping with
the data obtained for the PrfA regulon. Therefore, this analysis
was extended to include SigmaECF that is active under microaerophilic
growth conditions. However, none of the genes identified in
a transcriptome analysis of this regulon were observed as
being under control of PrfA as well. As quite unexpected extension
of this list arose from the analysis of agl, an agr-like locus
involved in quorum sensing. The loss of agl impairs invasiveness
and intracellular proliferation, and again a transcriptome
analysis revealed that a number of known virulence factors
are part of the Agl regulon. These data indicate that a wide
regulatory network exists for the regulation of virulence
with obviously many indirect and currently unknown connections.
Sequence and expression data were collected in a central database,
RibDB, that contains the data generated by the REALIS consortium..
[1]
U. Kärst and the REALIS Consortium. REALIS: Postgenomic
analysis of Listeria monocytogenes. Comp Funct Genom 3: 32-34
(2002)
Sarah Webb
Integrating data across platforms
Dr Sarah Webb
Oxford University Begbroke Science Park, Centre for Ecology
and Hydrology, Mansfield Road, Oxford OX1 3SR
Since the advent of genomic and proteomic technologies, the
volumes of data being generated by experimentalists has accelerated
exponentially resulting in the continual need for development
of data storage and integration solutions. Genomic data is
extremely complex in structure and much of our data is difficult
to represent properly even in complex relational formats as
there is a lack of recognised data standards and specialist
tools to manipulate and view it.
The Environmental Genomics Thematic Programme Data Centre
aims to serve as a local repository for data, formatting it
according to international standards and submit to the appropriate
public databases. To be able to achieve this we need:
Understanding and agreement of what data and annotations
should be provided.
Standard format of data exchange.
Development of standard vocabularies and ontologies
for describing microarray experiments and samples (applicable
to proteomics data too).
Development of standard protocols, reference samples,
controls and data normalisation methods.
The
mission of this thematic programme is to use existing &
emerging genomic/proteomic knowledge & technology to gain
a better understanding of ecosystem structure & function.
This is being implemented through the funding of projects
that address fundamental ecological and evolutionary questions
in environmentally important organisms ranging from microbes
to vertebrates in a proactive data management initiative which
combines open-source and commercial bioinformatics solutions
for analysing, storing, distributing, and mining genomic data.
Repositories will be implemented that capture data from sources
such as LIMaS - a LIMS tool designed specifically for arrays
that is MIAMIE supportive for ESTs, microarrays, proteomics,
etc which with be tied together by a shared set of meta-data.
To facilitate this, we will provide suitable software for
genomic analysis directly through the development of Bio-Linux,
a customised version of Linux for bioinformatics research.
Bio-Linux is downloadable over the Internet. http://ivgfs.nox.ac.uk/
Heiko
Schoof
MIPS
participated in the first analysis of the completed Arabidopsis
genome and maintains an online database for that data, MAtDB
(http://mips.gsf.de/proj/thal/db).
This has been the model dataset and backbone for integration
of data from various plant species. In our view, a genome
is just a parts list. Once the set of genes is available,
their interactions and regulations are what defines a plant's
lifestyle.
Two major challenges have to be faced in order to efficiently
exploit the plethora of data. On the one hand, heterogenous
data must be conveniently available in an integrated, comprehensive
yet easy to access genome knowledge resource. This involves
keeping data current, designing data models that can evolve
with new data, simple integration of external data, comprehensive
views and simple access for humans and applications. On the
other hand, analysis methods are required to discover new
knowledge from the data.
MAtDB implements automatic update procedures to harvest new
data from public databases, while ensuring high data quality
by manual curation of inconsistencies. Flexible data modelling
based on XML transformations and modular architecture try
to address the problem of keeping pace with evolving data.
To integrate within a federated system that also allows application-level
access, MAtDB implements BioMOBY (http://www.biomoby.org)
compliant web services. This is also the basis for building
an integrated biological knowledge resource for plant genome
data within the European PlaNet project (http://www.eu-plant-genome.net).
References:
1. Schoof H. (2003) Towards interoperability in genome databases:
the MAtDB (MIPS Arabidopsis thaliana database) experience
Comp Funct Genom 4: 255-258
2. Schoof,H., Zaccaria,P., Gundlach,H., Lemcke,K., Rudd,S.,
Kolesov,G.,
Arnold,R., Mewes,H.W. and Mayer,K.F.X. (2002) MIPS Arabidopsis
thaliana Database (MAtDB): an integrated biological knowledge
resource based on the first complete plant genome. Nucleic
Acids Res., 30, 91-93.
3. Schoof,H., Ernst,R., Nazarov,V., Pfeifer,L., Mewes,H.W.
and Mayer,K.F.X. (2004) MIPS Arabidopsis thaliana Database
(MAtDB): An integrated biological knowledge resource for plant
genomics. Nucleic Acids Res., submitted
4. Schoof H, Ernst R, Mayer KFX. (2004) Comp Funct Genom 5:
184-189
|