- Training Courses
- Workshops
- Grants & Fellowships
- European Conference in Functional Genomics
- Meeting Reports
- Online Registration

 

 

The ESF Programme on Functional Genomics workshop on
"Data Integration in Functional Genomics: Application to Biological Pathways"

Swiss Institute of Bioinformatics, Geneva, Switzerland, 22-24 September 2003

Organisers
Report
Abstracts

Organisers

Pierre-Alain Binz Swiss Institute of Bioinformatics, Geneva, Switzerland and Department of Biological Sciences and Bioinformatics, University of Geneva, Geneva, Switzerland
Henning Hermjakob European Bioinformatics Institute, Hinxton, UK
Paul Van der Vet Department of Computer Science, University of Twente, Enschede, The Netherlands

Report

Abstract

We report from the second ESF Programme on Functional Genomics workshop on Data Integration, which had sessions on 'Current status of the use of experimental information to create biological pathways databases in existing consortia/projects', 'Pathways as part of bioinformatics infrastructures', 'Design, creation and formalization of biological pathways databases', 'Generating and supporting pathway data', 'Interoperability of databases with other external databases and standards' and 'Future perspectives'. Key issues emerging from the discussions were the need for continued funding to cover maintenance and curation of databases, the importance of quality control of the data in these resources, and how to facilitate the exchange of data and ensure the interoperability of databases.

Introduction

The integration of heterogeneous data and information is a key issue in the field of functional genomics. Currently available technologies are producing floods of results that have to be stored, interpreted, validated and correlated with biological significance. Databases have been created that collect information on protein-protein interaction and biological pathways.
A first workshop on data integration in functional genomics and proteomics was organised in Geneva in October 2001 [1, and http://www.functionalgenomics.org.uk/sections/activitites/Reports/report_geneva_2001.htm]. It was organised within the framework of the European Science Foundation (ESF) Programme on Integrated Approaches for Functional Genomics (http://www.functionalgenomics.org.uk). The goal was to bring together scientists with different backgrounds (biologists, bioinformaticians) who were participating in projects involving or requiring integration of heterogeneous biological data. The theme 'data integration requirements' in the framework of general functional genomics approaches was extensively discussed, with a particular focus on the proteomics related questions (a rapidly evolving area which is a good example of data heterogeneity). The general discussion led, among other propositions, to the proposal to organise another workshop that would focus on a more specific aspect of data integration issues. The second Geneva workshop, that is reported here and in a number of separate contributions published in this special issue, has therefore chosen to concentrate the topic of data integration applied to description, interpretation and understanding of biological pathways.

The first session was devoted to the current status of the use of experimental information to create biological pathways databases in existing consortia/projects. The second session focused on how activities in biological pathways are implemented in existing or developing bioinformatics infrastructures. The third session discussed the design, creation and formalization of biological pathways databases. The fourth session was entitled "Generating and supporting pathway data" and focused on experimental data that support interpretation of biological pathways or / and can be used to generate biological pathways. The fifth session approached the technical aspects of database interoperability and required standards. The last session was dedicated to future perspectives.
The workshop has brought together scientists and bioinformaticians, who are involved in major multidisciplinary projects as well as in the development of functional genomics databases.

Some of the participants submitted individual material for publication in a special section of the journal Comparative and Functional Genomics (CFG). This report summarizes the main presentation and discussion points and includes some extended abstracts of the presentations.


Sessions and subsequent discussions

Session 1: Current status of the use of experimental information to create biological pathways databases in existing consortia/projects

The session started with a report of the activities of the 2002 ESF workshop on molecular networks provided by Sergio Nasi. He presented the molecular complexity that proteins display in living systems. He described some of the theoretical approaches that are developed to represent and model molecular networks. It appeared from the discussion that most of the modeled systems are based on known experimental data and are therefore no real prediction yet. The need for closer communication between biologists and theoreticians has been also mentioned. The ESF report can be found at http://www.functionalgenomics.org.uk/sections/activitites/Reports/report_granada_2002.htm. Sergio Nasi has submitted a review to the special section in CFG that presents a nice source of information on available web resources in the field of pathways, including some comments on the efficiency and approaches.

The status of four EU-funded projects since the Geneva ESF workshop in 2001 were then exposed:

Uwe Kaerst has presented the final status of the EU-funded consortium named REALIS. The project aimed at a postgenomic analysis of the Gram-positive human and animal pathogen Listeria monocytogenes. He presented RibDB, the database created during the project. As for other FP5 EU projects, there is no funding to mantain such databases at the end of the project. The discussion obviously highlighted the need of long-term funding plans for the maintainance and updates of such databases, useful to the entire scientific community.

Ramon Alonso Allende has discussed the bioinformatics environment of the REGIA (Regulatory Gene Initiative in Arabidopsis) consortium. The Consortium members accumulate experimental protein and genome data. This includes also phenotypic analysis of mutant and transgenic species, expression arrays data and metabolic analyses. The data are made available through integration into the PlaNet network (see below). Difficulties to generate an appropriate database model and design is linked with the difficulty to cope with rapidly evolving experimetal technologies and to make biologists aware of the constraints of database developers.

The PlaNet consortium of European plant databases has been presented by Heiko Schoof and is submitted as an article in the special section of CFG.

Colin Harwood, while presenting the BACELL consortium, has highlighted a number of issues that many databases are facing. One can mention the problem of quality control on the data, that are usually not peer-reviewed. As data originally come from different experimental approaches and technologies, data are sometimes considered as good and trustworthy if a correlation exists between various datasources. In many consortia, there is a difficulty to accommodate the generation of experimental data that are provided towards the end of the project. The time available for datamining activities is often reduced to a minimum.

Note: REALIS, REGIA, PlaNet, and BACELL are funded under the Fifth Framework Programme by the Quality of Life and Management of Living Resources Programme (http://europa.eu.int/comm/research/quality-of-life.html) of the European Commission. Description of the projects can be found on the Community R&D Information Service (CORDIS) server (http://www.cordis.lu/).

Session 2: Pathways as part of bioinformatics infrastructures

Paolo Romano discussed the issues of creating a network of biological resources such as EBRCN (European Biological Resources Centers Network) (http://www.ebrcn.org). Combining persistence of the information with heterogeneity of the data, formal descriptions of links between databases or data sources are necessary. EBRCN has a follow up in the CABRI (Common Access to Biological Ressource and Information) (http://www.cabri.org) EU project. CABRI is presented in a separate contribution in the special section of CFG.

The development of LIMas (LIMS for arrays) at the MRC in UK (http://www.mgu.har.mrc.ac.uk/microarray/limas/) was presented by Sarah Webb. This system has considered standardization for file and information exchange. The development of the system has followed the MIAME compliance recommendation (http://www.mged.org/Workgroups/MIAME/miame.html). PEDRO (http://pedro.man.ac.uk) is also considered for the handling of proteomics data. An extended abstract of the presentation is attached to the report.

The Data Integration and Analysis for Medical Systems Biology (DIAL) approach adopted in the Netherlands was introduced by Johannes Van Beek. A contribution describes DIAL in detail in the special section in CFG.

Anne Morgat discussed ways to represent pathways from different points of views. One common approach is to describe them with components and relationships principles. The issues of interoperability of databases and of methods were illustrated with the Genostar (http://www.genostar.org) environment and the GenoExpertBacteria (http://www-geb.inrialpes.fr) project, which uses the ENZYME and KEGG databases and the High-quality Automated and Manual Annotation of microbial Proteomes (HAMAP) project developed at the Swiss Institute of Bioinformatics.


Session 3: Design, creation and formalisation of biological pathways databases

The environment of the Arabidopsis Genome Database was discussed by Heiko Schoof. He started with a list of functional requirements that should be apprehended when starting a database integration project. He presented the MIPS Arabidopsis thaliana DataBase (MatDB) (http://www.mips.biochem.mpg.de/proj/thal/db/), a federated database that make available, through a common interface, genomic information from various databases. MetaMIPS attempts to build metabolic pathways. He also mentioned the Genome Research Environment (GENRE) project as a flexible workhorse for the annotation of genome information. All genomes being annotated by the MIPS group will move to GenRE (http://mips.gsf.de/genre/proj/genre/) and allow for comprehensive annotation of complex genomic features. The discussion was about the quality of data made available. There seems to be a need for quality criteria before data can be deposited in public databases.

Kristian Axelsen described ENZYME (http://us.expasy.org/enzyme/), a repository of information on nomenclature of enzymes based on the IUBMB (International Union of Biochemistry and Molecular Biology) (http://www.iubmb.unibe.ch/) recommendations (http://www.chem.qmul.ac.uk/iubmb/). The CAS registry numbers are also considered for the description of involved chemicals. He pointed out the currently slow process of attributing new classifications. He discussed some open issues, such as the missing link to systematic information on pathways, difficult control of accurate propagation of correction or updates of EC numbers and of general information. As the reactivity of an enzyme is dependent on its biological environment and thermodynamic conditions, the experimental conditions should be prominently present in the description of an enzyme activity. Kristian introduced also the project IntEnz (http://www.ebi.ac.uk/intenz/index.html), which aims to act as central repository for enzyme nomenclature. IntEnz will integrate information from ENZYME, BRENDA, IUBMB

Dietmar Schomburg presented BRENDA (http://www.brenda.uni-koeln.de/), a well-established comprehensive Enzyme Information System. It contains information on enzyme reaction and specificity, structure, function, isolation procedure, stability, taxonomic occurrence, etc. The data, extracted from original literature, is manually evaluated by scientists, thus reducing the amount of low quality data.

Claudia Choi introduced TRANSPATH® (http://www.biobase.de/pages/products/transpath.html), a signal transduction pathway database, tightly linked to TRANSFAC® (http://www.biobase.de/pages/products/transfac.html), a transcription factor database. The information from TRANSPATH® can be visualised using PathwayBuilder™. Claudia Choi has submitted an article to the CFG special section.

Ulrike Wittig discussed the EML project to develop an integrated database system for computational analysis and visualisation of biochemical pathways. In that respect, she described a classification system of chemical compounds to support complex queries in pathway database. She presented also the first version of BioBrowser, a tool that queries chemical compounds by class and subclass types. For more details, see her paper in the CFG special section.


Session 4: Generating and supporting pathway data

Ioannis Xenarios presented DIP, a protein-protein interaction database (http://dip.doe-mbi.ucla.edu/). He embellished his presentation with warning messages, such as one about the relative danger of interpreting protein-protein interactions measured in screening experiments as real interactions in vivo. Small changes in the experimental conditions might drastically change the presence or absence of measured interactions. Even further, the definition of a protein-protein interaction might differ according to the context. A protein described in SwissProt and considered only by its primary structure might not correspond to the real interacting partner in an experiment. He highlighted also the general lack of comment on the reliability of experimentally measured interactions in databases.

The future of Swiss-Prot (http://www.expasy.org/sprot) and TrEMBL (http://www.ebi.ac.uk/trembl/index.html), becoming part of the UniProt project (http://www.expasy.org/uniprot/) was introduced by Amos Bairoch. The database reached the million entries and should double in 2-3 years. This is due to the large amount of bacterial genomes sequenced (about one a week). It is estimated that half of the SwissProt entries are enzymes. The information concerning the activity, structure, membership of a pathway, etc can be found in various fields of each entry. With respect to pathways, an effort is made to leave the free-text description format and go for a standardization of the representation, as well as references pointing to pathway databases.

Djamel Medjahed presented a system to generate virtual 2D maps of proteins and to look for cancer clues from transformation of experiments on microdissected tissues (TMAP for Tissue Molecular Anatomy Project). See his paper in the CFG special section for more details.


Session 5: Interoperability of databases with other external databases and standards

Philipp Reiser approached gene function by performing reverse engineering of metabolic pathways in yeast. Starting from information in KEGG (http://www.genome.ad.jp/kegg/) and IUPAC chemical nomenclature (http://www.chem.qmw.ac.uk/iupac/), he modeled biochemical reactions and built relations between genes and pathways. This method should be able to browse the relationship between nutrients and biochemically synthesized compounds. He also described the Robot Scientist, a system that helps designing experiments and management of a liquid handling system using machine learning.

A method to extract bioentities and relationships from Medline abstracts was presented by Christian Blaschke. He pointed out the difficulty to extract consistent information from text. The nomenclatures for genes are variable and confused, different genes are represented as homonymous abbreviations, different words are used in different scientific communities for the same bioentity, and gene names are often represented as nested terms. Ways to look for sets of terms in sentences instead of only terms have been discussed.

As Henning Hermjakob was unfortunately ill, Ioannis Xenarios talked about the Proteomics Standards Initiative (PSI) (http://psidev.sourceforge.net). PSI is a bioinformatics initiative of the Human Proteome Organisation (HUPO) that develops standard data formats, representation, exchange and annotation in proteomics. In collaboration with the MGED consortium (http://www.mged.org) and the American Society of Testings and Materials (ASTM) (http://webstore.ansi.org), it proposes recommendations and an XML format for data exchange. It orients its activities along three complementary axes: protein-protein interaction, mass spectrometry and general proteomics.

Frederique Lisacek has presented an innovative approach to shape biological knowledge. She described environments to improve characterization of protein annotation by combining information from various complementary sources, i.e. databases, prediction tools and human expert input. More details can be found in her contribution as paper in this section of CFG.


Session 6: Perspectives, general discussion

A general discussion took place at the end of the workshop. The topics were chosen according to the discussion points debated already in the first five sessions.

Sustainability of databases
The difficulty of finding funds to maintain and update databases that were financed by EU projects and other time-delimited projects was raised in the previous Geneva ESF workshop on data integration in 2001 and confirmed again this year. According to comments made in the discussion, UK grants stipulate that databases will be maintained but it is not explicitly stated where funds will come from. On the other hand, it seems that UK employment legislation will affect career structure of post-docs. This might result hopefully in having more scientific offices that could manage such databases. The question of the role of the European Bioinformatics Institute (EBI) has been raised. As EBI has a service role of providing access to databases, EBI could be involved in the design phase of all databases that might be, at the end of a time-defined funded project, asked to be made available on its site. This implies the description and development of guidelines that can include the compatibility conditions for these databases to fulfill. Another approach is to propose that the interested scientific communities or even societies should apply and get money to maintain these databases themselves. Maybe a EU Network of Excellence should be created that can implement the issue of sustainability of biological databases.

Quality of information in databases, propagation of errors
During this workshop, questions about quality of information have often been asked to database developers and providers. Many databases lack references to the experimental sources of the information, so that appropriate tracking and interpretation of the biological quality is hindered. Another issue is that the information provided is not always linked with the description on how the data has been manipulated, integrated and interpreted between the original experimental source (experiment or original publication) and the final state of the database content. Quality of data and confidence degree is linked with the consideration that end-users and database providers are used to quote limitations and power of specific technologies. It seems not always mandatory for databases to contain biological interpretation of deposited data. However, databases should clearly mention experimental or literature evidences for each data. Proposals for the organization of a future workshop that should address specifically the quality question in databases have been made.


Conclusions

The participants have found the workshop interesting, fruitful and in need of a continuation. The presentations have generally been the starting point of many discussions. Questions related to the quality and to the sustainability of the information provided by consortia or databases have regularly arisen. It appeared clearly that this theme has to be addressed even deeper in a future workshop. Technical and functional solutions for facilitating the exchange of information between data sources have to be proposed. The examples of the MGED consortium and of the PSI initiative present good models of standardization of data and information representation in biological pathways databases. The funding agencies, particularly the EU, should be approached and made aware of the need of the scientific community to have access to funding possibilities for finding solutions to these issues. A decision to submit a proposal for an ESF workshop that addresses the quality issues has been approved. Tentatively, Martin Hofmann, Paul van der Vet and Pierre-Alain Binz will be in charge of this proposal.


Acknowledgements

We would like to thank Mike Taussig and Annette Martin for the support in the organization from the European Science Foundation (ESF) Programme on Integrated Approaches for Functional. Thanks also to Genomics Jazztime, the Dixieland band that entertained the participants on the Geneva Lake during their trip to the place of the social dinner, i.e. the restaurant Creux-de-Genthoz. Thanks also to Joan Marsh, who has taken charge of the notes during the workshop. Finally, we are deeply thankful to Laure, Claudia, Dolnide and Veronique for the local logistic organisation.


[1] Binz P.-A., Martin A., Taussig M., de Daruvar A. The ESF Programme on Integrated Approaches for Functional Genomics. Workshop on "Data Integration in Functional Genomics and Proteomics" (2002) Comp Funct Genom 3: 16-21


Abstracts

Uwe Kaerst

The REALIS project aimed at a postgenomic analysis of the Gram-positive human and animal pathogen Listeria monocytogenes. The scientific objectives were (i) to study the evolution of a pathogenic organism by comparative genomics of L. monocytogenes, L. innocua and clonally successful pathovariants, (ii) the development of postgenomic strategies to provide a complete picture of the remarkable adaptive abilities of this food-borne pathogen, (iii) the understanding of the molecular mechanisms by which environmental clues are perceived and translated into adaptive responses and (iv) the establishment of an integrated bioinformatic database incorporating information on biological pathways [1].
A central research area was the identification of regulatory networks in Listeria. This work primarily focused on virulence-associated proteins starting with the known virulence gene cluster and its central transcriptional regulator, PrfA. As in silico analyses indicated a significantly larger extent of this regulatory unit transcriptomics and proteomics analyses were carried out to define the complete PrfA regulon. Comparing several mutants and growth conditions three groups of genes could be defined that responded, e.g, to the amount of PrfA; there was, however, only a partial overlap in the genes or proteins identified by both strategies. Furthermore, several genes were obviously only indirectly regulated by PrfA as no specific promoter could be detected upstream of these genes. A large number of these genes are very likely controlled by the alternative sigma factor B, the general stress response regulator, that was shown to interact with the regulation of PrfA itself. Transcriptomics and proteomics analyses of the SigB regulon identified 106 genes controlled by SigB, one of which with no SigB-specific promoter and only partially overlapping with the data obtained for the PrfA regulon. Therefore, this analysis was extended to include SigmaECF that is active under microaerophilic growth conditions. However, none of the genes identified in a transcriptome analysis of this regulon were observed as being under control of PrfA as well. As quite unexpected extension of this list arose from the analysis of agl, an agr-like locus involved in quorum sensing. The loss of agl impairs invasiveness and intracellular proliferation, and again a transcriptome analysis revealed that a number of known virulence factors are part of the Agl regulon. These data indicate that a wide regulatory network exists for the regulation of virulence with obviously many indirect and currently unknown connections. Sequence and expression data were collected in a central database, RibDB, that contains the data generated by the REALIS consortium..

[1] U. Kärst and the REALIS Consortium. REALIS: Postgenomic analysis of Listeria monocytogenes. Comp Funct Genom 3: 32-34 (2002)


Sarah Webb
Integrating data across platforms
Dr Sarah Webb
Oxford University Begbroke Science Park, Centre for Ecology and Hydrology, Mansfield Road, Oxford OX1 3SR

Since the advent of genomic and proteomic technologies, the volumes of data being generated by experimentalists has accelerated exponentially resulting in the continual need for development of data storage and integration solutions. Genomic data is extremely complex in structure and much of our data is difficult to represent properly even in complex relational formats as there is a lack of recognised data standards and specialist tools to manipulate and view it.
The Environmental Genomics Thematic Programme Data Centre aims to serve as a local repository for data, formatting it according to international standards and submit to the appropriate public databases. To be able to achieve this we need:
• Understanding and agreement of what data and annotations should be provided.
• Standard format of data exchange.
• Development of standard vocabularies and ontologies for describing microarray experiments and samples (applicable to proteomics data too).
• Development of standard protocols, reference samples, controls and data normalisation methods.

The mission of this thematic programme is to use existing & emerging genomic/proteomic knowledge & technology to gain a better understanding of ecosystem structure & function. This is being implemented through the funding of projects that address fundamental ecological and evolutionary questions in environmentally important organisms ranging from microbes to vertebrates in a proactive data management initiative which combines open-source and commercial bioinformatics solutions for analysing, storing, distributing, and mining genomic data. Repositories will be implemented that capture data from sources such as LIMaS - a LIMS tool designed specifically for arrays that is MIAMIE supportive for ESTs, microarrays, proteomics, etc which with be tied together by a shared set of meta-data. To facilitate this, we will provide suitable software for genomic analysis directly through the development of Bio-Linux, a customised version of Linux for bioinformatics research. Bio-Linux is downloadable over the Internet. http://ivgfs.nox.ac.uk/

Heiko Schoof

MIPS participated in the first analysis of the completed Arabidopsis genome and maintains an online database for that data, MAtDB (http://mips.gsf.de/proj/thal/db). This has been the model dataset and backbone for integration of data from various plant species. In our view, a genome is just a parts list. Once the set of genes is available, their interactions and regulations are what defines a plant's lifestyle.
Two major challenges have to be faced in order to efficiently exploit the plethora of data. On the one hand, heterogenous data must be conveniently available in an integrated, comprehensive yet easy to access genome knowledge resource. This involves keeping data current, designing data models that can evolve with new data, simple integration of external data, comprehensive views and simple access for humans and applications. On the other hand, analysis methods are required to discover new knowledge from the data.
MAtDB implements automatic update procedures to harvest new data from public databases, while ensuring high data quality by manual curation of inconsistencies. Flexible data modelling based on XML transformations and modular architecture try to address the problem of keeping pace with evolving data. To integrate within a federated system that also allows application-level access, MAtDB implements BioMOBY (http://www.biomoby.org) compliant web services. This is also the basis for building an integrated biological knowledge resource for plant genome data within the European PlaNet project (http://www.eu-plant-genome.net).

References:
1. Schoof H. (2003) Towards interoperability in genome databases: the MAtDB (MIPS Arabidopsis thaliana database) experience Comp Funct Genom 4: 255-258
2. Schoof,H., Zaccaria,P., Gundlach,H., Lemcke,K., Rudd,S., Kolesov,G.,
Arnold,R., Mewes,H.W. and Mayer,K.F.X. (2002) MIPS Arabidopsis thaliana Database (MAtDB): an integrated biological knowledge resource based on the first complete plant genome. Nucleic Acids Res., 30, 91-93.
3. Schoof,H., Ernst,R., Nazarov,V., Pfeifer,L., Mewes,H.W. and Mayer,K.F.X. (2004) MIPS Arabidopsis thaliana Database (MAtDB): An integrated biological knowledge resource for plant genomics. Nucleic Acids Res., submitted
4. Schoof H, Ernst R, Mayer KFX. (2004) Comp Funct Genom 5: 184-189