- Training Courses
- Workshops
- Grants & Fellowships
- European Conference in Functional Genomics
- Meeting Reports
- Online Registration

 

 

2nd BioCreAtIvE Challenge Evaluation Workshop
23-25 April 2007
Madrid, Spain

Organisers
Report
1. Summary
2. Scientific content
3. Discussion at the event
4. Assessment of the results & impact of the event
5. Programme
6. Speakers

Organisers:

Martin Krallinger, CNIO, Madrid, Spain
Alfonso Valencia, CNIO, Madrid, Spain
Lynette Hirschman, MITRE, Bedford, MA, USA

Draft Report

Summary

The growth of scientific literature databases such as PubMed, together with the increasing interest in more efficient information access demanded by the biology community, has resulted in methods that can automatically process collections of biological texts. Text mining aims to efficiently retrieve and classify documents in response to complex user queries and to perform a deeper analysis of the literature to extract specific associations, such as protein-protein interactions and protein annotations.

The goal of this workshop was to drive development and evaluation of text mining tools applied to biological relevant tasks in context of the BioCreAtIvE challenge. BioCreAtIvE is a community-wide effort for evaluating text mining and information extraction systems applied to the biological domain. BioCreAtIvE arose out of the needs of working biologists, biological curators and bioinformaticians to access the wealth of information in the literature, and to link this information to biological databases and ontologies. BioCreAtIvE focuses on the comparison of methods and community assessment of scientific progress, rather than on the purely competitive aspects. BioCreAtIvE was organized through collaborations between text mining groups, biological database curators and bioinformatics researchers.

The Second BioCreAtIvE Challenge (see biocreative.sourceforge.net) was held during October of 2006; the evaluation workshop sponsored by the ESF took place in Madrid at the Spanish National Cancer Center (CNIO) between the 23rd-25th, of April 2007. The workshop served to discuss three tracks. The first focused on tools for finding mentions of genes and proteins in sentences drawn from MEDLINE abstracts and was coordinated by John Wilbur (NCBI). The second track was related to producing a list of the EntrezGene identifiers for all the human genes/proteins mentioned in a collection of MEDLINE abstracts and is similar to BioCreAtIvE I Task 1B, coordinated by Lynette Hirschman (MITRE). The third track of BioCreAtIvE II was related to protein interaction detection, coordinated by the CNIO, in collaboration with two of the main protein interaction databases (MINT and INTACT). The BioCreAtIvE protein interaction challenge (Protein-Protein Interaction task) included detection of articles containing information relevant to experimentally characterized protein interactions, the detection of actual confirmed protein interaction pairs, the extraction of experimental methods used to characterize the interactions, and the corresponding text evidence. The importance of the BioCreAtIvE methodology of evaluating text-mining systems on real biological problems has attracted the interest of other groups developing curated databases. For example, we supported the OregAnno database developers in organizing a RegCreative annotation jamboree.

The second BioCreAtIvE challenge evaluation workshop sponsored by the ESF provided the opportunity to the biomedical text mining and bioinformatics community to discuss the results and performance of text mining systems, to determine the start of the art and successful methods and algorithms as well as to point of future demands both in terms of biologically interesting tasks as well as specific modules that require improvements. The scientific content of the presented talks is described in more detail within the second BioCreAtIvE challenge evaluation workshop proceedings.

Scientific Content

The scientific content of the second BioCreative challenge meeting was structured in terms of the main tasks posed at BioCreative II, namely:
1) The evaluation, analysis and techniques of Gene Mention (GM) detection systems, where algorithms based on Conditional Random Fields (CRFs) showed especially promising results.
2) The evaluation, analysis and techniques of Gene Normalization (GN) techniques, where flexible dictionary lookup methods in combination with statistical disambiguation techniques could provide competitive results.
3) The evaluation, analysis and techniques of protein-protein interaction systems. Here several sub-tasks are covered, from information retrieval and document categorization techniques to classify correctly protein interaction relevant articles, to the extraction of binary protein interaction pairs from the literature together with the experimental qualifier supporting the interactions as well as the full text evidence passage describing a given interaction characterization.

In addition prominent researchers in the field provided talks relevant to the topics discussed during the BioCreative event, namely about the TREC Genomics Track, ongoing initiatives of publishers that can be of importance for text mining initiatives as well as from talks related to the importance of text mining applications for bioinformatics, biological databases and manual literature annotation efforts.

Discussion at the event

The second BioCreAtIvE evaluation workshop served as a framework to discuss general characteristics of the most successful text mining techniques and features applied to biologically meaningful tasks that had been evaluated on a common evaluation data collection prepared for the BioCreative challenge. Another aspects that had been addressed were meaningful standard evaluation metrics for each particular text mining task as well as how the combination of several methods into a combined output (e.g. based on majority voting or even using a machine learning based approach to assign weights to each system depending on its performance) can result in better predictions results. Although the combination of several system predictions are theoretically interesting, to be of practical interest these systems need to be available online and their predictions have to be directly comparable, e.g. through use of common prediction formats. Therefore also the implementation of a BioCreative metaserver system has been discussed during the meeting, being now available at: bcms.bioinfo.cnio.es. This system represents the first metaserver implemented for the biomedical text mining domain, allowing flexible comparison of multiple system predictions as well as access a more programmatic access through XML-RPC. The main limitations of current text mining technologies as well as the progress in terms of performance when comparing systems to the previous BioCreAtIvE event (BioCreAtIvE I) were discussed in detail. Finally future initiatives and potential tasks for BioCreAtIvE III were discussed as well as how Gold Standard evaluation data preparation could help to improve gradually the development of robust text mining systems for this domain.

Assessment of the results & impact of the event

The second BioCreAtIvE challenge attracted forty-four teams from 13 countries worldwide. Each of the participating systems was evaluated based on their predictions. Most of the submissions were evaluated using precision, recall and f-measure (either micro- or macro-averaged). The predictions were evaluated against manually curated annotations extracted by domain experts. Also inter-annotator agreement studies on the used data collections were carried out in order to determine the degree of difficulty of the posed task as well as to determine data consistency. To provide some insights to methodological aspects of participating systems, invited participants were able to give a short system descriptions talk during the evaluation workshop together with a short technical paper describing their implementation in the BioCreAtIvE workshop proceedings.

The BioCreAtIvE challenge has a considerable impact in the biomedical text mining community, not only reflected in terms of number of participating systems, number of BioCreative data download and subscriptions to the BioCreative mailing list, but also in terms of scientific publications citing or referring to this event (over 89 publications). A special issue in the journal Genome Biology will also provide a more detailed description of the Second BioCreAtIvE challenge outcome and data analysis. This special issue includes an introductory paper on the BioCreAtIvE II event, task evaluation papers, more detailed system descriptions of the most competitive strategies as well as an opinion paper providing input of the most relevant researches within this domain.

The BioCreAtIvE Metaserver (BCMS), whose implementation was discussed initially during the evaluation workshop has also attracted considerable interest, both on terms of the 13 participation annotation servers worldwide that are currently contributing to this initiative but also as an infrastructure to enable structured digital abstracts containing author generated annotations or protein interactions. A pilot study where the BCMS has been considered as a useful tool is the FEBS Letters experiment of structured digital abstracts.

Programme

Monday - April 23

9:00-10:00 Registration
10:00-10:30 Welcome and Introduction to Second BioCreative Challenge (Alfonso Valencia )

Session I a - GM: Detection and Evaluation of Gene Mentions
Chair: Hagit Shatkay

10:30-11:15 Introduction BioCreative II: Gene mention task (W. John Wilbur)

11:15-11:30 Coffee break / Registration

11:30-11:45 GM: Identifying Gene Mentions by Case-based Classification (Mariana Lara Neves)
11:45-12:00 GM: A Novel Feature Representation: Integrate GM Results into Interaction Abstract Identification (Richard Tsai)
12:00-12:15 GM: Combined Conditional Random Fields and n-Gram Language Models for Gene Mention Recognition (Craig A. Struble)
12:15-12:30 GM: Tackling the BioCreative2 Gene Mention task with Conditional Random Fields and Syntactic Parsing ( Andreas Vlachos)
12:30-12:45 GM: Named Entity Recognition with Combinations of Conditional Random Fields (Roman Klinger)
12:45-13:00 GM: Gene Mention Recognition Using Lexicon Match Based Two-Layer Support Vector Machines. (Yifei Chen)

13:00-14:00 Lunch break

Session I b - GM: Detection and Evaluation of Gene Mentions cont.
Chair: John Wilbur

14:00-14:15 GM: Using Semi-Supervised Techniques to Detect Gene Mentions (Sophia Katrenko)
14:15-14:30 GM: BioCreative II Gene Mention Tagging System at IBM Watson (Rie Kubota Ando)
14:30-14:45 GM: Rich Feature Set, Unification of Bidirectional Parsing and Dictionary Filtering for High F-Score Gene Mention Tagging (Cheng-Ju Kuo)
14:45-15:00 GM: High-Recall Gene Mention Recognition by Unification of Multiple Backward Parsing Models (Chun-Nan Hsu)
15:00-15:15 GM: Attribute Analysis in Biomedical Text Classification (Francisco Carrero)
15:15-15:30 GM: Three tricks can lead to a 24% relative reduction in error. (Kuzman Ganchev)

15:30-15:45 Coffee break / Registration

15:45-16:15 Invited speaker Corpus Annotation and Its Use in BioNLP (Junichi Tsujii)
16:15-17:00 Panel discussion 1 Current state and future demands in Bio-NER
17:00-18:00 Poster session 1

Tuesday - April 24

8:45-9:30 Registration

Session II a - GN: Evaluation of Gene Normalization
Chair: Luis Rocha

9:30-10:15 Introduction Overview of BioCreative II Gene Normalization (Lynette Hirschman)
10:15-10:30 GN: Text Detective: Gene/protein annotation tool by Alma Bioinformatics (Christian Blaschke)
10:30-10:45 GN: Peregrine: Lightweight gene name normalization by dictionary lookup (Martijn Schuemie)
10:45-11:00 GN: Gene Normalization Using Machine Learning and Flexible Dictionary Lookup (Hongfang Liu, Manabu Torii)
11:00-11:15 GN: Me and my friends: gene mention normalization with background knowledge (Jörg Hakenberg)

11:15-11:30 Coffee break / Registration

Session II b - GN: Evaluation of Gene Normalization cont.
Chair: Lynette Hirschman

11:30-11:45 GN: Context-Aware Mapping of Gene Names using Trigrams (ThaiBinh Luong)
11:45-12:00 GN: ProMiner: Recognition of Human Gene and Protein Names using regularly updated Dictionaries (Juliane Fluck)
12:00-12:15 GN: Human Gene Normalization by an Integrated Approach including Abbreviation Resolution and Disambiguation (Katrin Fundel)
12:15-12:30 GN: A Hybrid Gene Normalization approach with capability of disambiguation (Heng-hui Liu)
12:30-12:45 GN: Exploring Match Scores to Boost Precision of Gene Normalization (Bo-Hou Yang)
12:45-13:00 GN: Rule-based Gene Normalization with a Statistical and Heuristic Confidence Measure (William Lau)

13:00-14:00 Lunch break

14:00-14:30 Panel discussion 2 Importance of Gene Normalization for 'down-stream' text mining

Session III: Extraction of Biological Annotations
Chair: Alfonso Valencia

14:30-15:00 Invited speaker Annotating molecular interactions in the MINT database (Gianni Cesareni)
15:00-15:30 Invited speaker Enhancing access to the bibliome for genomics with evaluation tasks derived from user information needs: The TREC Genomics Track (Aaron M. Cohen)
15:30-16:00 Invited speaker The application of ontologies in the biological realm (Suzanna Lewis)

16:00-16:15 Coffee break / Registration

Session III a - IAS: Interaction Article Extraction
Chair: Christan Blaschke

16:15-16:35 Introduction The Interaction-Article Sub-Task evaluation (Martin Krallinger)
16:35-16:50 IAS: The BioCreAtIve 2 GN and PPI-IAS Tasks: Approaches and Analysis (Aaron Cohen)
16:50-17:05 IAS: Semi-supervised Learning of Relevant Articles (Mark Stevenson)
17:05-17:20 IAS: ProtIR Prototype: Finding Relevant Abstracts for Protein-Protein Interaction in the BioCreAtIvE2 Challenge (Yan Hua Chen)
17:20-17:35 IAS: A Term Investigation and Majority Voting for Protein Interaction Article Sub-task 1 (IAS) (Lan Man)
17:35-17:50 IAS: Identifying Protein-Protein Interaction Sentences Using Boosting and Kernel Methods (Sun Kim)
17:50-18:20 Invited speaker IntAct - Serving the text-mining community with high quality molecular interaction data (Samuel Kerrien)

Wednesday - April 25

8:45-9:00 Registration

Session III b - IPS: Interaction-Pair and Interaction Method Extraction
Chair: Carlos Rodriguez

9:00-9:40 Introduction The Interaction-Pair and Interaction Method Sub-Task evaluation (Martin Krallinger)
9:40-9:55 IPS: OntoGene in Biocreative II (Fabio Rinaldi)
9:55-10:10 IPS: GeneTeam Site Report for BioCreative II: Customizing a Simple Toolkit for Text Mining in Molecular Biology (Patrick Ruch)
10:10-10:25 IPS: AKANE System: Protein-Protein Interaction Extraction in the BioCreAtIvE2 Challenge (Rune Saetre)

10:25-10:40 Coffee break / Registration

10:40-10:55 IPS: Consensus pattern alignment to find protein-protein interactions in text (Jörg Hakenberg)
10:55-11:10 IPS: Using predication for identifying Protein-Protein interactions in Biomedical publications (Alejandro Figueroa, Günter Neumann)
11:10-11:25 IPS: Integrating knowledge extracted from biomedical literature: normalization and evidence statements for interactions (Graciela Gonzalez)
11:25-11:40 IPS: Mining physical protein-protein interaction by exploiting abundant features (Minlie Huang)
11:40-11:55 IPS: Uncovering Protein-Protein Interactions in the Bibliome (Luis Rocha)
11:55-12:25 Invited Speaker (Matthew Day)

12:25-13:25 Lunch break

13:25-14:00 Poster session 2

Session III c - ISS: Interaction-Sentence Extraction
Chair: Patrick Ruch

14:00-14:30 Introduction The Interaction-Sentence Sub-Task evaluation (Martin Krallinger)
14:30-14:45 ISS: An integrated approach to concept recognition in biomedical text (Larry Hunter)
14:45-15:00 ISS: Adapting a Relation Extraction Pipeline for the BioCreAtIvE II Tasks (Barry Haddow)

15:00-15:30 Coffee break / Registration

15:30-15:45 ISS: Extracting Interacting Protein Pairs and Evidence Sentences by using Dependency Parsing and Machine Learning Techniques (Arzucan Ozgur)
15:45-16:00 ISS: Protein Interaction Sentence Identification by Using Hierarchical Template-Based Approach (Heng-hui Liu)
16:00-16:30 Invited speaker OregAnno database and RegCreative (Casey Bergman)
16:30-17:30 Panel discussion 3 Text mining and biological annotations

Closing session

17:30-18:30 The future of text mining and information extraction challenge evaluations in the biomedical domain (Lynette Hirschman, Alfonso Valencia, John Wilbur)

(GM: Gene Mention task, GN: Gene Normalization Task, IAS: Interaction Article Subtask, IPS: Interaction Pairs Sub-task, IMS: Interaction Method Sub-task, ISS: Interaction Sentence Sub-task)

Speakers

Alfonso Valencia
John Wilbur
Mariana Lara Neves
Richard Tsai
Craig A. Struble
Andreas Vlachos
Roman Klinger
Yifei Chen
Sophia Katrenko
Rie Kubota Ando
Cheng-Ju Kuo
Chun-Nan Hsu
Francisco Carrero
Kuzman Ganchev
Junichi Tsujii
Christian Blaschke
Martijn Schuemie
Hongfang Liu
Manabu Torii
Jörg Hakenberg
ThaiBinh Luong
Juliane Fluck
Katrin Fundel
Heng-hui Liu
Bo-Hou Yang
William Lau
Gianni Cesareni
Martin Krallinger
Aaron Cohen
Mark Stevenson
Yan Hua Chen
Lan Man
Sun Kim
Samuel Kerrien
Fabio Rinaldi
Patrick Ruch
Rune Saetre
Alejandro Figueroa
Günter Neumann
Graciela Gonzalez
Minlie Huang
Luis Rocha
Matthew Day
Larry Hunter
Barry Haddow
Arzucan Ozgur
Heng-hui Liu
Casey Bergman
Lynette Hirschman