- Training Courses
- Workshops
- Grants & Fellowships
- European Conference in Functional Genomics
- Meeting Reports
- Online Registration

 

 


Information Extraction in Molecular Biology

Report Scientific content  

Assessment of the results
Abstracts
A
Genomics, informatics, and society
Gert-Jan B. van Ommen
 
B
A terminology management workbench for molecular biology
Sophia Ananiadou, Hideki Mema, and Goran Nenadic
 
C
Mining for Gene Nomenclature
Wain HM, Bruford EA, Lovering RC, Lush M, Wright MW, Povey S.
 
D
Biosemantics. Towards computer-assisted extraction and processing of biological information
B. Mons
 
E
Information extraction from biomedical texts
Jerry R. Hobbs
 
F
"Deep" information extraction from biomedical documents
Udo Hahn, Stefan Schulz, and Martin Romacker
 
G
Semantic induction with EMILE
Pieter Adriaans
 
H
Ontology-driven information extraction
Uwe Reyle and Jasmin Šaric
 
I
Information extraction on health-related discussion forums
Stefan Geißler and Ramon Wartala
 
J
Protein functional classification by text data-mining
Ben Stapley
 
K
Biological function and DNA expression arrays
Christian Blaschke, Luis Cornide, Juan Carlos Oliveros, and Alfonso Valencia
 
L
A suite of tools to mine scientific abstracts for nuclear receptors
Dietrich Schuhmann
 
M
Raw data, funding, and knowledge
Gert Vriend
List of participants

Report ... (draft)

Scientific content

Full or partial genome sequences have become available for many species, and new sequences are published at a fast rate. The new challenge is functional genomics, the identification of gene functions. Molecular biologists realise that functional genomics takes integration of information coming from many sources. One of the sources is the existing literature, which is under-used. The sheer volume makes processing by hand impossible. Now many articles become available in digital form, various groups have begun to develop programs able to extract information from natural-language texts.

This workshop aims to promote information extraction efforts by bringing together researchers from molecular biology, bio-informatics, natural-language processing, and computer science. The workshop addresses technical issues (such as statistical versus declarative approaches) and issues on the level of the task definition (such as the trade-off between investments and results). Information extraction is studied in diverse disciplines, and we believe that practitioners can strengthen their own efforts by exchanging ideas and approaches with practitioners in other disciplines. The emphasis of the workshop is on informal exchange of ideas catalysed by talks. We hope the workshop will result in the participants' becoming aware of what other disciplines have to offer, and in multidisciplinary collaborations.

Assessment of the results

The workshop aimed primarily at being a meeting point for researchers who would otherwise not know of each other's existence. We thus hope to contribute to strengthening what already is a highly interdisciplinary field. In this respect, the workshop was a great success. The organisers have received quite a lot of email messages from participants who have established new and fruitful contacts. By publishing a volume of Proceedings (not funded by ESF), the organisers also have given all participants a state-of-the art review of many of the major trends in information extraction technology, especially as applied to texts in molecular biology. The Proceeding thus form part of the scientific results. Not all speakers have delivered a paper. As an appendix to the present report, we have included abstracts of all talks save one (the person in question has not sent us an abstract in spite of repeated requests.) Below, when we refer to a workshop talk, we mention only the actual speaker; full lists of authors for these contributions can often be found in the Proceedings.

In sum, it appears that information extraction, although drawing on a tradition in computer science, has not matured to a sufficient degree for real-life use. Different techniques are tried, which is not surprising because "information extraction" is itself not a well-defined term; it turns out to cover many different tasks. On a continuum of tasks, simple term extraction forms one extreme, full understanding of a text the other. Simple term extraction may serve to identify co-occurrence of terms in a single text, where such a co-occurrence often signifies a relation like that between two proteins. Term extraction is not a trivial task because term recognition in texts is difficult (Ananiadou); standardisation efforts like that by HUGO may help here (Wain). When the method is extended by keeping statistics, it is an efficient tool for automatic classification of texts. Co-occurrence identification, however, is too weak to discriminate between the various relations that might hold between the referents of terms; it is even incapable of distinguishing between a text that asserts a relation and one that denies it. Stapley and Blaschke both discussed statistical techniques, Stapley for automatic classification of proteins and Blaschke for finding Medline abstracts related to particular genes. The other extreme, full understanding of a text, is (as Hobbs called it) AI-complete: in other words, it takes solving all problems in artificial intelligence before it is feasible. Full text understanding can therefore not be a goal of a research project. Between these extremes there is a continuum of approaches that differ in the granularity of results they produce and in the number and complexity of the resources they require. SRI International (Hobbs) uses its relatively lightweight FASTUS technique, developed and thoroughly tested in the course of DARPA's Message Understanding Conferences. The three contributions by Hahn, Šaric and Schuhmann are more ambitious and draw upon not only natural-language processing but also knowledge representation techniques and ontologies to build systems capable of a finer analysis than FASTUS can.

One of the more important lessons that transpired from this meeting, forcefully worded by Vriend, is that every information extraction project presupposes a well-defined question. The question determines what kind of approach should be used, which resources are needed, and, perhaps most importantly, whether the project is going in the right direction or not. It turned out that quite some workers in the field undertake information extraction projects without being sufficiently aware of what such a system should deliver.

The organisers expect that, for some time to come, the field will be dominated by groups trying out all sorts of approaches. Research will by necessity be highly interdisciplinary. The present workshop aimed to facilitate interdisciplinary collaboration by bringing together practitionners from different fields. From the reactions of participants we can conclude that this aim has been achieved to everyone's satisfaction. The initial learning period will eventually, in the mid-term, result in a relatively stable set of techniques with well-understood properties and requirements. The period is also required for building resources needed by more ambitious approaches, such as ontologies and knowledge bases with background knowledge. Since building such resources is very expensive, only collaborative efforts aimed at sharable resources are affordable. Against this, preparing a resource for sharing introduces additional difficulties. Because ontologies are also helpful in other fields, like that of database sharing and data integration, the primary focus should be on ontology development.

Abstracts

A. Genomics, informatics, and society
Prof.dr. Gert-Jan B. van Ommen
Department of Human and Clinical Genetics
Leiden University Medical Center
senior Vice-President and past president of HUGO - Human Genome Organization

The last decade has seen many great successes of linkage analysis and positional cloning, now scaled up in the Human Genome Project. Paradigm changes in medical genetics have brought major progress in unravelling the aetiology of most of the major genetic diseases (e.g. Duchenne Muscular Dystrophy, cystic fibrosis, Huntington disease, myotonic dystrophy, X-linked mental retardation) and hereditary cancer syndromes (e.g. retinoblastoma, neurofibromatosis, colon, skin and breast cancer). The genome project has enormously stimulated the development of advanced technology to characterize DNA and study genes. Consequently a spectacular rise has occurred in the identification of causes of genetic disease. Nearly all important, frequent diseases (about 100-150), and a large number of rarer diseases have been traced back to their causally defective gene. In most cases (in total for ca. 1000-1500 genes) also the underlying mutations have been identified. The first, highly valuable spin-off has been the development of specific and reliable diagnostics, thus relieving insecurity and long and burdening diagnostic odysseys. A welcome development first and foremost for patients and their relatives, but also for their medical caretakers.

While often reviled as boring routine by the classical cell- and molecular biologist favouring detailed functional study, in reality the tale of the isolation of major disease genes has been one of original strategies and resourceful tinkering, of large-scale collaboration and competition, rewarded by the finding of entirely novel disease mechanisms and inheritance modes. The study of human genetic disease and animal model systems has highlighted the existence of several novel genetic mechanisms, the impact of which could never have been conceived otherwise, like genetic imprinting, germline mosaicism, trinucleotide repeat expansion and anticipation. In turn, the study of these processes has greatly deepened our fundamental insights in genetics. These worldwide advances have yielded a better understanding of the chain of events connecting the molecular defects in genes, via the functional disturbances in cells and organs, to the clinical effects on the organism as a whole. This so-called 'genotype-phenotype correlation' is of paramount importance, not only in optimal patient and family counselling, but also to define proper groups for the evaluation of strategies for therapy and prevention, especially when more experimental, pharmacological and gene therapies will come within reach. Unfortunately, many of the processes determining the genotype-phenotype correlation are still elusive, as they depend on more complex interactions between multiple genes, different variant alleles of these genes and, last but not least, between genes, gene variants and the environment.

Functional genomics. The combination of genomic information with large-scale miniaturization and automation, will bring great advances in insights. Even today, the increasing power of bioinformatics (databases, image processing and Internet use), nanotechnology (the DNA-chip) and automation (laboratory robotics) is putting us on the eve of a quantum leap in understanding of functional networks in living organisms, via an unprecedented scaling-up of information gathering, processing and interpretation. The systematic description of the data in genome projects of man and other organisms, is just the first essential step on a long way. The descriptive stages of the genome project are currently being complemented by high-throughput technology to discover the functions and interactions of the newly found genes. The combined results of cross-comparison of the data between different genes of one organism and between the genes and genomes of different organisms, the development of targeted animal models for human disease and the large-scale parallel analysis of gene-expression profiles of tissues in normal versus diseased state and during growth and development, will fundamentally improve our capacities at complex pattern recognition. In scientific research, this is the ever-recurrent basis for new, verifiable hypotheses. Predictably therefore, the quest for the biological functions and interactions of our genes and the subsequent development of medical interventions, be it via gene therapy, or probably more often through pharmacological routes, uncovered by our improved understanding, will undergo an unparalleled blossoming time. This will apply equally to 'classical' genetic diseases and to common, multifactorial disorders like cardiovascular disease, cancer, hypertension, rheumatic arthritis, migraine, Parkinson and Alzheimer and even to infectious diseases and injury hazards.

Pharmacogenomics. The aim of the human genome diversity (HGD) project, a population-based offspring of the genome project, is the elucidation of the individual variation of genes. Recently, a section of this variation has attracted major attention of the biotech/pharma industry. The study of genetic factors governing drug response, a field dubbed 'pharmacogenomics', is widely expected to hold tremendous promise for better targeted pharmacological treatment. Currently ill-understood differences in efficacy and side-effects of medicine between different persons are most likely based on genetic differences in drug uptake and metabolism. It is easy to see how the elucidation of these differences could unlock major possibilities for more effective, individually-tailored medical treatments. This promises to greatly reduce health care cost due to ineffective or even disadvantageous drug treatments.

HUGO. Early in the genome research era, scientists worldwide recognized that a free international competition, while healthy as a mechanism to stimulate rapid progress, could potentially cause important wasting of abilities and resources. In contrast, a basic level of coordination and international collaboration, when properly monitored, was envisaged to greatly increase efficiency and assist in generating resources and defining problem areas requiring special attention. To stimulate this coordination, the Human Genome Organization (HUGO) was conceived in 1988 and founded in 1989. In line with the state of research, the major emphasis of HUGO has initially been on guiding the mapping, which previously had proceeded under an even looser, self-imposed scientific regime, the Human Gene Mapping (HGM) workshops, organized biannually by the gene mapping community. HUGO monitors and assists the genome mapping process on one hand by organizing annual Human Genome Mapping Meetings, and on the other hand by a program of smaller meetings and workshops on specific topics, including, for example, the program of single-chromosome workshops, supported by the European Community and NIH and DOE in America.

Mapping, sequencing or both? Until recently, the emphasis of genome mapping has been on the detailed mapping of genes on a chromosome-by-chromosome basis. However, with the current methodological advances, novel whole-genome high-throughput sequencing methods have been established. By these methods, currently about 30.000 human expressed genes have been identified and mapped with great precision to specific subregional locations in the working draft of the human genome. This working draft was announced on June 26, 2000. For the high-quality finishing stage of whole genome sequencing, the debate is ongoing on whether refined regional approaches by expert laboratories are still required or alternatively, if sequence-based techniques have become powerful enough to abolish the need for scrupulous mapping altogether. The popularity, in the medically oriented community, of the genome database GDB, established in 1989 and one of the world's mapping database par excellence, underscores that there is a need for a well-curated, comprehensive phenotype-oriented mapping and annotation database. This will greatly assist functional genomics by providing not only gene maps but also a reference of freely accessible mapping and application data for clinical research, an indispensable condition for the fruitful translation of genome knowledge into practical solutions. This is both true for the mapping of major common diseases and for direct clinical applications in (differential) diagnosis.

Intellectual Property. In the wake of the rapid advances in discovering new genes, a fierce debate has emerged on public-versus-private aspects of our genome heritage. Especially in the field of the analysis of human cDNA/gene sequences and their comparison with other species to unravel function, major issues are still unresolved on how to strike the balance between, on one hand, maximal scientific progress and public benefit - typically served by immediate public access of newly generated data - and proper patent protection of intellectual property of inventions on the other hand, required to safeguard the staggering investments to develop therapies. The existence of an independent international organization like HUGO, which does not report directly to specific governments, industries or funding bodies, is an important asset to an unbiased international discussion. In 1992, 1996, 1997 and 2000, HUGO has generated policy papers on public access, patenting and related intellectual property issues including single-nucleotide polymorphisms (SNPs) and the effect of the European Directive on patenting biologicals (http://www.hugo-international.org).

Genetic services and education. On the side of caution, major points have yet to be addressed to convert the new opportunities of genetic services into beneficial healthcare: First, the provision of requested information, which may be very burdensome to the applicant, needs to be properly embedded in expert clinical-genetic healthcare and preceded as well as guided by well-designed, understandable information. This requires additional research into the impact of genetic information and expansion of the professional field, to deal with increased possibilities and demand. The future implementation of screening programs for major genetic diseases, to widen the access of the public to voluntary preventive and therapeutic options, including life-style choices, increases the need to politically address the level of professional care provision. Furthermore, also the public needs to be better educated into the value, impact and limitations of genetics, especially adolescents as consumers of these services in the immediate future.

Global ethics. An entirely different, but equally important question is whether out society is ready to assimilate these profound changes. One should not overestimate the adaptive capacity of societies. Not even regionally, let alone worldwide. Nationally, and more in general in Western societies, the threat looms of privacy infringements, unequal access to health care and selective in- or exclusion from insurances or labour. Clearly, it would defeat the purpose of genetics research, when exactly those who should benefit most from the developments are put in jeopardy by them. The threat of social inequity also exists on a global scale: There is a great diversity in cultures, social priorities and economic strength between Europe, Asia and America. Even greater health and wealth differences exist between the North and the South. Thus, the establishment of ethically and morally acceptable standards has to be approached with great caution, involving many more parties than only the western scientists, industry and policy makers. This is yet another field where an independent, worldwide organization of knowledgeable professionals of genomics and genetics background has an invaluable contribution, next to regional genetics societies like the American, European and Australasian Societies for Human Genetics and other world agencies like UNESCO. In 1996, the HUGO Ethics committee has published a 'Statement on the Principled Conduct of Genetics Research', which has been widely acclaimed. This is the basis of further refinements applicable to specific situations. The first specific issue has been addressed in the statement "DNA Samples: Control and Access", which has been published in march 1998, at HUGO's HGM98 Meeting in Turin. This statement seeks to carefully balance privacy issues versus the value of coded maintenance of sample identity, with a view on future validation of epidemiological and specific studies. In March 1999, at the Brisbane HUGO Meeting HGM99 the "HUGO Ethics Statement on Cloning" was issued, placing this contentious issue in the current framework of human genetics. Further HUGO statements are on "Benefit sharing (April 2000) and "Gene Therapy" (April 2001). These statements are typically prepared by the HUGO Ethics committee after thoroughly reviewing 60-80 statements and documents from national and international bodies, private and governmental, and are intended to assist national and supranational policy determination and standardization of ethical review.

Economics and funding. As witnessed by the ongoing fierce debate on public versus private issues, commercial development in different western regions and increased attention for less privileged populations, an international dialogue is in order on how to reap the profits of our insights on a balanced, worldwide scale and how to prevent just another increase of the technological gap. Indeed, most European nations could easily end up on the 'wrong' side of this gap. The increasing hi-tech nature of advanced biomedical research and the recent enormous funding increase in biomedical research in the US and Japan, combined with the comparatively limited investments in fundamental biomedical research in Europe, are about to put a heavy mortgage on the role which Europe may still play in areas of decisive impact on health care and economy in the 21st century. In order to reap major benefits for human health care in diagnostic, therapeutic and preventive medicine and to realize the vast array of business opportunities, a more active stimulation of genome research should be considered a priority task by national ministries of health, education and economy, supranational bodies such as the European Union and funding agencies in healthcare such as the Wellcome Trust, where possible assisted by the biotech and pharmaceutical industrial field.


Relevant websites: www.celera.com; www.dmd.nl; www.ebi.ac.uk; www.gene.ucl.ac.uk; www.geneclinics.com; www.hugo-international.org/hugo;
www.gdb.org; www.infobiogen.fr; www.lgtc.nl; www.ncbi.nlm.nih.gov/dbEST;
www.ncbi.nlm.nih.gov/omim; www.ornl.gov/hgmis; www.orphanet.com


B. A terminology management workbench for molecular biology
Sophia Ananiadou,1 Hideki Mema,2 and Goran Nenadic3

1: Computer Science, University of Salford, United Kingdom
S.Ananiadou@salford.ac.uk
2: Dept. of Information Science, University of Tokyo, Japan
mima@is.s.u-tokyo.ac.jp
3: Computer Science, University of Salford, United Kingdom
G.Nenadic@salford.ac.uk

In this paper we introduce the design of a web-based integrated terminology management workbench, in which information extraction and intelligent information retrieval/database access are combined using term-oriented natural language tools. Our work is placed within the BioPath research project whose overall aim is to link information extraction to expressed sequence data validation. The aim of the tool is to extract automatically terms, to cluster them, and to provide efficient access to heterogeneous biological and genomic databases and collections of texts, all wrapped into a user friendly workbench enabling users to use a wide range of textual and non-textual resources effortlessly. For the evaluation, automatic term recognition and clustering techniques were applied in a domain of nuclear receptors.


C. Mining for Gene Nomenclature
Wain HM, Bruford EA, Lovering RC, Lush M, Wright MW, Povey S.

HUGO Gene Nomenclature Committee, The Galton Laboratory, Department of Biology, University College London, United Kingdom, nome@galton.ucl.ac.uk

"One gene, one name" is an important concept for accurate communication in genetic research. The HUGO Gene Nomenclature Committee (HGNC) has so far been responsible for naming one third of the genes estimated from the draft of the human genome sequence.

Genew, the Human Gene Nomenclature Database <http://www.gene.ucl.ac.uk/cgi-bin/nomenclature/searchgenes.pl>, is the only resource that provides data for all human genes which have approved symbols. The data in Genew are highly curated by HGNC editors and gene records can be searched on the Web by symbol or name to directly retrieve information. A number of files are exported from Genew and these provide further data to assist information extraction including lists of known aliases and previous symbols, as well as sequence accession IDs. These are available from <http://www.gene.ucl.ac.uk/nomenclature/code/ftpaccess.html>.

There are a number of potential problems inherent in searching scientific text, so we provide a unique approved gene symbol for each human gene and encourage its use by journals and databases. To this end, we concurrently edit both our Genew database and LocusXRef, a database that feeds directly into LocusLink <http://www.ncbi.nlm.nih.gov/LocusLink/>.

The HUGO Gene Nomenclature Committee web page is at URL http://www.gene.ucl.ac.uk/nomenclature/. The work of the HGNC is supported by NIH contract N01-LM-9-3533 (60%) and by the UK Medical Research Council (40%).


D. Biosemantics. Towards computer-assisted extraction and processing of biological information
B. Mons

Institute of Medical Informatics, Erasmus University Rotterdam, Netherlands;
barend.mons@inter.nl.net

The explosion of textual and molecular information available to biologists today is creating on the one hand an unprecedented collective resource for research, but on the other hand has an almost suffocating effect in that it has become entirely impossible to read the relevant literature in anyone's discipline. Also information on complex interactions will be more and more spread over a wide range of individual research papers. It becomes almost impossible to rapidly make the connection between multiple factors influencing biological processes, for example as indicated by a group of seemingly unrelated genes being expressed as a response to an experimental stimulus.

The rapid expansion of databases containing information on molecules and their interaction adds another layer of complexity to the information we have to process.

Rather than re-educating all molecular biologists to become computational biologists we must be pro-active in the set up of respectful collaboration between molecular biologists, geneticists and computational experts working on molecular and textual analysis.

As cellular and molecular biology essentially address molecules, their aggregation into organelles and their interactions, the knowledge about relations between molecules grows exponentially to the discovery of the molecules themselves. Methodologies to assist in information retrieval and interpretation are still in their infancy and those that have been published are experimental tools and far from being stable products for generic use.

The new group on BIOSEMANTICS, as recently established at the Erasmus University in Rotterdam will focus on the development of tools to assist interactive knowledge mining in molecular databases and in full text literature. The systems will be partly based on Collexis® technology, which is able to abstract concepts (including molecular concepts) from vast amounts of textual information with a speed which for the first time to make this process interactive with the user. By adding the (biological) relationship between the concepts, we will create "conceptual semantic networks", which will allow rapid analysis of overlapping biochemical pathways and molecular relationships with unprecedented speed. With major databases and publishers, arrangements have been made already to build this system with the highest quality partner network available today.
The seminar will give a preview on the systems to be designed and the discussion will focus on potential collaboration between Leiden and Rotterdam in this field.

E. Information extraction from biomedical texts
Jerry R. Hobbs

Artificial Intelligence Center , SRI International, Menlo Park, California, USA;
hobbs@ai.sri.com

Information extraction is the process of scanning text for information relevant to some interest, including extracting entities, relations, and events. It requires deeper analysis than key word searches, but its aims fall short of the very hard and long-term problem of full text understanding. Information extraction represents a midpoint on this spectrum, where the aim is to capture structured information without sacrificing feasibility.

One of the key ideas in this technology is to separate processing into several stages, in cascaded finite-state transducers. The earlier stages recognize smaller linguistic objects and work in a largely domain-independent fashion. The later stages take these linguistic objects as input and find domain-dependent patterns among them.

There are now initial efforts to apply this technology to biomedical text, In other domains, the technology plateaued at about 60% recall and precision. Even if applications to biomedical text do no better than this, they could still prove to be of immense help to curatorial activities.


F. "Deep" information extraction from biomedical documents
Udo Hahn,1 Stefan Schulz,1,2 and Martin Romacker2

1: Text Knowledge Engineering Lab, Freiburg University, Freiburg, Germany
2: Department of Medical Informatics, Freiburg University Hospital, Freiburg, Germany

MEDSYNDIKATE is a natural-language processor for automatically harvesting knowledge from medical finding reports. The content of these documents is transferred to formal representation structures which constitute a corresponding text knowledge base. The system architecture integrates requirements from the analysis of single sentences, as well as those of referentially linked sentences forming cohesive texts. The strong demands MEDSYNDIKATE poses to the availability of expressive knowledge sources are accounted for by two alternative approaches to (semi)automatic knowledge engineering. We also present data for the knowledge extraction performance of MEDSYNDIKATE for three major syntactic patterns in medical documents.


G. Semantic induction with EMILE
Pieter Adriaans

Institute for Logic, Language and computation, University of Amsterdam, Amsterdam, Netherlands; pieter.adriaans@ps.net

The production of textual information in the biomedical domain is very large. More than 400.000 papers are added to Medline per year. Recently researchers have started to work on the implementation and use of text mining tools to extract knowledge from these huge data sets. For the larger part these tools employ more traditional text mining techniques like frequency counts, n-gram search etc. This paper briefly describes the possibilities of the use of more advanced grammar induction tools for semantic learning. We introduce the EMILE grammar induction tool. Possibilities to use the tool for semantic learning in specific domains, in particular bioinformatics discussed and illustrated with some results. The value of grammar induction tools over standard text mining solutions lies in their more exhaustive structural analysis of the text. The EMILE tool can be downloaded for research purposes at: http://turing.wins.uva.nl/~pietera/Emile/


H. Ontology-driven information extraction
Uwe Reyle1 and Jasmin Šaric2

1: Institute for Computational Linguistics, University of Stuttgart, Stuttgart, Germany; Uwe.Reyle@ims.uni-stuttgart.de
2: European Media Laboratory, Heidelberg, Germany; Jasmin.Saric@eml.villa-bosch.de

We describe the linguistic components of a system (GenIE) that automatically extracts information about biochemical pathways from free-text sources. We show that the extraction of information must be based on specialized lexica, semantic representations and ontologies. Furthermore a case study is presented that motivates the use of deductive schemata and a systematic relationship between low-level and high-level representations for the interleaving of shallow and deep processing.


I. Information extraction on health-related discussion forums
Stefan Geißler and Ramon Wartala

TEMIS Deutschland; stefan.geissler@temis-group.com; ramon.wartala@temis-group.com


Temis is a software and consulting company specializing in products and solutions in the field of Text Mining. To analyse huge text corpora, Temis offers solutions for categorization (Insight Discoverer™ Categorizer), clustering (Insight Discoverer™ Clusterer), information extraction (Insight Discoverer™ Extractor) and presentation (OnlineMiner™). Our information extraction demo on health-related news groups demonstrates and explains the main architecture and functionalities. On the linguistic side, the information extraction process uses plug-able modules for industry-, domain- and language-specific knowledge (called SkillCartridges™).The extraction technology itself is the main component and based on cascaded finite-state transducers. The extractor today supports seven different European languages (Russian, Chinese and others in preparation). Also, Temis offers a clusterer that uses a hierarchical clustering approach efficient enough to cluster several thousands of documents online. Finally, the OnlineMiner is a Java-Enterprise-based Intranet component for searching, collecting, analysing and clustering huge document corpora that integrates the above mentioned components into a comprehensive application. Using these three products, we analysed many medical- and health-related news groups from the USENET. This sort of text corpora are very heterogeneous and the main question are: Can Text Mining help us to find interesting patterns of relations between drugs and diseases? Another important question is: Which drugs are mentioned most often in which subgroup of USENET users? To find a solution, we defined domain-specific knowledge into specific SkillCartridges based on official Drug lists and Diseases thesauri. The extraction process produces trees with concepts for all the coded knowledge. Together with the plain text resources, we store the results in a database management system (DBMS). Then, we can access the clustered results with the OnlineMiner on the server-side and a normal Web browser on the client. The OnlineMiner offers different views on the collected data. The starting point for all there exploration is an interface similar to that of a normal Internet search engine. To get a more intuitive about the relevance of the various concepts, we can show a cluster graph or a pie/bar-chart. One result our demo system delivers is a clustering result on the news group misc.health.diabetes that produces a chart with the big four of diabetes medicines. This set-up allows to aggregate information from larger text collections and produce reports for detailed questions in a novel way.


J. Protein functional classification by text data-mining
Ben Stapley

Biomolecular Sciences, UMIST, Manchester, United Kingdom; b.stapley@icrf.icnet.uk

Classifying proteins into classes based on their cellular role or function is a powerful way of making sense of genomic data. Here we present a method for the automatic classification of proteins by analysis of relevant Medline abstracts. We employ keyword matching using a thesaurus of S.cerevisiae gene naming terms to retrieve relevant text for each protein. From these term vectors are generated for each protein. We then train support vector machines to automatically partition the term space and to thus discriminate the textual features that define a protein's function. We test the method on the task of assigning proteins to various subcellular locations. The method is benchmarked on a set of proteins of known sub-cellular location. No prior knowledge of the problem domain nor any natural language processing is used at any stage. Our method has comparable performance to rule-based text classifiers and we find that amino acid compositional information is a poor predictor when employed in isolation. Combining text with protein amino-acid composition improves recall. We discuss the generality of the method and its potential application to a variety of biological classification problems.

K. Biological function and DNA expression arrays
Christian Blaschke,1 Luis Cornide,2 Juan Carlos Oliveros,3 and Alfonso Valencia4

1: Protein Design Group at the CNB/CSIC, Cantoblanco, Universidad Autonoma, Madrid, Spain; blaschke@cnb.uam.es
2: ALMA Bioinformatics, Tres Cantos, Spain; lcornide@almabioinfo.com
3: Protein Design Group at the CNB/CSIC, Cantoblanco, Universidad Autonoma, Madrid, Spain; oliveros@cnb.uam.es
4: Protein Design Group at the CNB/CSIC, Cantoblanco, Universidad Autonoma, Madrid, Spain; valencia@cnb.uam.es

DNA arrays are one of the types of large-scale experiments that have been developed over the last years. These experiments allow new biological insights but also provide an overwhelming flow of data that has to be digested and analysed properly. We developed an information extraction system (GEISHA) that provides an overview of the literature related to the genes that are implicated in an experiment. It extracts keywords and the most important parts of the related abstracts and re-organizes the information in a way that with much less effort a deeper insight in what was published already is possible. Here we present an overview of the system and the results that were obtained in different studies.


L. A suite of tools to mine scientific abstracts for nuclear receptors
Dietrich Schuhmann

LION Bioscience AG, Heidelberg, Germany; Dietrich.Schuhmann@lionbioscience.com

Bioinformatics has become a strong research area which offers and generates large amounts of data to molecular biologists. Nowadays this includes more and more tools to access data from the literature (semi-) automatically. The consequence is the (semi-)automatic extraction of information from abstracts and full papers concerned with biological topics.

LION Bioscience has developed proprietary software for the extraction of simple facts from sentences. The software is based on cascaded Finite State Transducer. The technology identifies patterns, e.g. 'in vitro', 'in a ligand-dependent manner', parts of sentences, e.g. verbs, nouns, adjectives, and combinations of both. Such combinations can be used to identify parts of sentences, which follow syntactical rules. For example a determiner followed by an adjective and a noun is called a noun phrase, e.g. 'the activated protein' is a noun phrase. Combinations of noun phrases in conjunction with a verb are used to identify relations between objects: e.g. 'Pleiotrophin binds to protein-tyrosine phosphatase zeta', where 'Pleiotrophin' as well as the combination of terms after the verb 'bind to' represent a noun phrase each. Furthermore, the described technology has to reduce information with the help of extraction of parts of the sentence: e.g. 'Pleiotrophin has been demonstrated to bind to protein-tyrosine phosphatase zeta (PTPzeta) with high affinity', is transformed into 'Pleiotrophin binds to PTPzeta', if this is more suitable to the user (biologist).

Different software components have been put in place to realise the running system: dictionaries of specific terms, software to identify complex terms, tagging tools, the cascaded FSTs, and viewing tools. The analysis tools can cope with any type of sentence. This does not solve the complexity of the information contained in such sentences.

An example for a pleasant sentence is: 'These results suggest that this peptide region is important for p21 interaction with cyclin E/Cdk2'. It results into 'p21 interacts with cyclin E/Cdk2'. The following sentence is an example where the anticipated outcome is unclear: 'To determine if this region mediates interaction with Ku70 in mammalian cells ...' It is unclear what 'this region' refers to and whether 'in mammalian cells' is an inevitable detail to the interaction with Ku70.

Any information extraction method has to meet the needs of the user. These needs must be clearly identified and described for several reasons: (1) the perception of the written information is different from user to user, (2) the topics of interest depend on the user group, and (3) standardisation (normalisation) of the information from the written text is yet an unsolved problem. The text mining group of LION Bioscience has run curation projects in collaboration with the laboratory at LION. The core topic of the curation projects were the identification of interactions between nuclear receptors, cofactors and ligands to control genetic regulation. This information was retrieved from Medline abstracts.

As a result, the group has generated a dictionary which contains about 1050 terms, i.e. names of nuclear receptors, cofactors, ligands and species. Furthermore, the dictionary contains about 11500 synonyms. This representation of terms is used to integrate the results from the information extraction methods into software solutions of LION Bioscience.

The presented work is funded by the German government (Project Nr. 0312385, Bio-Path). Project members are: Sophia Ananiadou, Salford University, Manchester, UK, and Prof. Günthner, Centrum für Informations- und Sprachverarbeitung, LMU Munich, Germany. Furthermore, the presented work is part of the Eureka project entitled Bio-Path. The project partner is ValiGen, Paris, France.

M. Raw data, funding, and knowledge
Gert Vriend

CMBI, University of Nijmegen, Nijmegen, Netherlands; vriend@cmbi.kun.nl

These days bioinformatics is going through a phase in which her major work is managing the data explosion. Granting agencies and industries have funded a technological revolution (very fast sequencing, micro arrays, robotics), but have realized too late that all these techniques generate gigabytes of raw data. At this moment it is clear that lots of funding should have been spent five years ago on bioinformatics, but that did not happen. Today, however, there is plenty of money for bioinformatics and I will try to summarize what is done, can be done and should be done with this money. The enormous amount of data opens up new possibilities for research. One example is the analysis of multiple sequence alignments with thousands of sequences, which allows us to determine the function of almost every residue for the task(s) of the molecule in the organism.

List of participants ...

Pieter Adriaans
University of Amsterdam, the Netherlands
Sophia Ananiadou
University of Salford, United Kingdom
Christian Blaschke
Centro Nacional de Biotecnologia, Spain
Stephan Brock
Metalife AG, Germany
Luis Cornide
Alma Bioinformatics S.L., Spain
Michele Finelli
University of Bologna, Italy
Ulrich Grob
Metalife AG, Germany
Roderic Guigo
GRIM, IMIM-UPF, Spain
Udo Hahn
Freiburg University, Germany
Jerry Hobbs
SRI International, USA
Robert Hoffmann
Centro Nacional de Biotecnologia, Spain
Hendri Hondorp
University of Twente, the Netherlands
Bernard Jacq
Laboratoire de Genetique et Physologie du Developement, France
Annette Martin
The Babraham Institute, United Kingdom
Jan De Meutter
Ghent University, VIB, Belgium
Barend Mons
Erasmus University Rotterdam, the Netherlands
Ivan Mraz
Inst. of Plant Molecular Biology, Czech Republic
Anton Nijholt
University of Twente, the Netherlands
Gertjan van Ommen
Leiden University Medical Center, the Netherlands
Jan Paces
Institute of Molecular Genetics, Czech Republic
Alberto Pascual Montano
National Center of Biotechnology, Spain
Karel Petrzik
Inst. of Plant Molecular Biology, Czech Republic
Ivan Rossi
University of Bologna, Italy
Jasmin Šaric
European Media Laboratory GmbH, Germany
Dietrich Schumann
LION bioscience AG, Germany
Miroslav Sip
Inst. of Plant Molecular Biology, Czech Republic
Ben Stapley
Biomolecular Modelling Lab, United Kingdom
Osman Ugur Sezerman
Sabanci University, Turkey
Alfonso Valencia
Centro Nacional de Biotecnologia, Spain
Paul van der Vet
University of Twente, the Netherlands
Gert Vriend
University of Nijmegen, the Netherlands
Hester Wain
University College London, United Kingdom
Ramon Wartala
TEMIS-group, France
Radek Zíka
Institute of Molecular Genetics, Czech Republic