Information Extraction in Molecular Biology
| Report |
Scientific
content |
|
|
Assessment
of the results |
|
| Abstracts |
A
|
Genomics,
informatics, and society
|
Gert-Jan
B. van Ommen |
| |
B
|
A
terminology management workbench for molecular biology
|
Sophia
Ananiadou, Hideki Mema, and Goran Nenadic |
| |
C
|
Mining
for Gene Nomenclature
|
Wain
HM, Bruford EA, Lovering RC, Lush M, Wright MW, Povey
S. |
| |
D
|
Biosemantics.
Towards computer-assisted extraction and processing of
biological information
|
B.
Mons |
| |
E
|
Information
extraction from biomedical texts
|
Jerry
R. Hobbs |
| |
F
|
"Deep"
information extraction from biomedical documents
|
Udo
Hahn, Stefan Schulz, and Martin Romacker |
| |
G
|
Semantic
induction with EMILE
|
Pieter
Adriaans |
| |
H
|
Ontology-driven
information extraction
|
Uwe
Reyle and Jasmin aric |
| |
I
|
Information
extraction on health-related discussion forums
|
Stefan
Geißler and Ramon Wartala |
| |
J
|
Protein
functional classification by text data-mining
|
Ben
Stapley |
| |
K
|
Biological
function and DNA expression arrays
|
Christian
Blaschke, Luis Cornide, Juan Carlos Oliveros, and Alfonso
Valencia |
| |
L
|
A
suite of tools to mine scientific abstracts for nuclear
receptors
|
Dietrich
Schuhmann |
| |
M
|
Raw
data, funding, and knowledge
|
Gert
Vriend |
| List
of participants |
Report
... (draft)
Scientific
content
Full
or partial genome sequences have become available for many
species, and new sequences are published at a fast rate. The
new challenge is functional genomics, the identification of
gene functions. Molecular biologists realise that functional
genomics takes integration of information coming from many
sources. One of the sources is the existing literature, which
is under-used. The sheer volume makes processing by hand impossible.
Now many articles become available in digital form, various
groups have begun to develop programs able to extract information
from natural-language texts.
This
workshop aims to promote information extraction efforts by
bringing together researchers from molecular biology, bio-informatics,
natural-language processing, and computer science. The workshop
addresses technical issues (such as statistical versus declarative
approaches) and issues on the level of the task definition
(such as the trade-off between investments and results). Information
extraction is studied in diverse disciplines, and we believe
that practitioners can strengthen their own efforts by exchanging
ideas and approaches with practitioners in other disciplines.
The emphasis of the workshop is on informal exchange of ideas
catalysed by talks. We hope the workshop will result in the
participants' becoming aware of what other disciplines have
to offer, and in multidisciplinary collaborations.
Assessment
of the results
The
workshop aimed primarily at being a meeting point for researchers
who would otherwise not know of each other's existence. We
thus hope to contribute to strengthening what already is a
highly interdisciplinary field. In this respect, the workshop
was a great success. The organisers have received quite a
lot of email messages from participants who have established
new and fruitful contacts. By publishing a volume of Proceedings
(not funded by ESF), the organisers also have given all participants
a state-of-the art review of many of the major trends in information
extraction technology, especially as applied to texts in molecular
biology. The Proceeding thus form part of the scientific results.
Not all speakers have delivered a paper. As an appendix to
the present report, we have included abstracts of all talks
save one (the person in question has not sent us an abstract
in spite of repeated requests.) Below, when we refer to a
workshop talk, we mention only the actual speaker; full lists
of authors for these contributions can often be found in the
Proceedings.
In
sum, it appears that information extraction, although drawing
on a tradition in computer science, has not matured to a sufficient
degree for real-life use. Different techniques are tried,
which is not surprising because "information extraction"
is itself not a well-defined term; it turns out to cover many
different tasks. On a continuum of tasks, simple term extraction
forms one extreme, full understanding of a text the other.
Simple term extraction may serve to identify co-occurrence
of terms in a single text, where such a co-occurrence often
signifies a relation like that between two proteins. Term
extraction is not a trivial task because term recognition
in texts is difficult (Ananiadou); standardisation efforts
like that by HUGO may help here (Wain). When the method is
extended by keeping statistics, it is an efficient tool for
automatic classification of texts. Co-occurrence identification,
however, is too weak to discriminate between the various relations
that might hold between the referents of terms; it is even
incapable of distinguishing between a text that asserts a
relation and one that denies it. Stapley and Blaschke both
discussed statistical techniques, Stapley for automatic classification
of proteins and Blaschke for finding Medline abstracts related
to particular genes. The other extreme, full understanding
of a text, is (as Hobbs called it) AI-complete: in other words,
it takes solving all problems in artificial intelligence before
it is feasible. Full text understanding can therefore not
be a goal of a research project. Between these extremes there
is a continuum of approaches that differ in the granularity
of results they produce and in the number and complexity of
the resources they require. SRI International (Hobbs) uses
its relatively lightweight FASTUS technique, developed and
thoroughly tested in the course of DARPA's Message Understanding
Conferences. The three contributions by Hahn, aric and
Schuhmann are more ambitious and draw upon not only natural-language
processing but also knowledge representation techniques and
ontologies to build systems capable of a finer analysis than
FASTUS can.
One
of the more important lessons that transpired from this meeting,
forcefully worded by Vriend, is that every information extraction
project presupposes a well-defined question. The question
determines what kind of approach should be used, which resources
are needed, and, perhaps most importantly, whether the project
is going in the right direction or not. It turned out that
quite some workers in the field undertake information extraction
projects without being sufficiently aware of what such a system
should deliver.
The
organisers expect that, for some time to come, the field will
be dominated by groups trying out all sorts of approaches.
Research will by necessity be highly interdisciplinary. The
present workshop aimed to facilitate interdisciplinary collaboration
by bringing together practitionners from different fields.
From the reactions of participants we can conclude that this
aim has been achieved to everyone's satisfaction. The initial
learning period will eventually, in the mid-term, result in
a relatively stable set of techniques with well-understood
properties and requirements. The period is also required for
building resources needed by more ambitious approaches, such
as ontologies and knowledge bases with background knowledge.
Since building such resources is very expensive, only collaborative
efforts aimed at sharable resources are affordable. Against
this, preparing a resource for sharing introduces additional
difficulties. Because ontologies are also helpful in other
fields, like that of database sharing and data integration,
the primary focus should be on ontology development.
Abstracts
A.
Genomics, informatics, and society
Prof.dr. Gert-Jan B. van Ommen
Department of Human and Clinical Genetics
Leiden University Medical Center
senior Vice-President and past president of HUGO - Human Genome
Organization
The
last decade has seen many great successes of linkage analysis
and positional cloning, now scaled up in the Human Genome
Project. Paradigm changes in medical genetics have brought
major progress in unravelling the aetiology of most of the
major genetic diseases (e.g. Duchenne Muscular Dystrophy,
cystic fibrosis, Huntington disease, myotonic dystrophy, X-linked
mental retardation) and hereditary cancer syndromes (e.g.
retinoblastoma, neurofibromatosis, colon, skin and breast
cancer). The genome project has enormously stimulated the
development of advanced technology to characterize DNA and
study genes. Consequently a spectacular rise has occurred
in the identification of causes of genetic disease. Nearly
all important, frequent diseases (about 100-150), and a large
number of rarer diseases have been traced back to their causally
defective gene. In most cases (in total for ca. 1000-1500
genes) also the underlying mutations have been identified.
The first, highly valuable spin-off has been the development
of specific and reliable diagnostics, thus relieving insecurity
and long and burdening diagnostic odysseys. A welcome development
first and foremost for patients and their relatives, but also
for their medical caretakers.
While
often reviled as boring routine by the classical cell- and
molecular biologist favouring detailed functional study, in
reality the tale of the isolation of major disease genes has
been one of original strategies and resourceful tinkering,
of large-scale collaboration and competition, rewarded by
the finding of entirely novel disease mechanisms and inheritance
modes. The study of human genetic disease and animal model
systems has highlighted the existence of several novel genetic
mechanisms, the impact of which could never have been conceived
otherwise, like genetic imprinting, germline mosaicism, trinucleotide
repeat expansion and anticipation. In turn, the study of these
processes has greatly deepened our fundamental insights in
genetics. These worldwide advances have yielded a better understanding
of the chain of events connecting the molecular defects in
genes, via the functional disturbances in cells and organs,
to the clinical effects on the organism as a whole. This so-called
'genotype-phenotype correlation' is of paramount importance,
not only in optimal patient and family counselling, but also
to define proper groups for the evaluation of strategies for
therapy and prevention, especially when more experimental,
pharmacological and gene therapies will come within reach.
Unfortunately, many of the processes determining the genotype-phenotype
correlation are still elusive, as they depend on more complex
interactions between multiple genes, different variant alleles
of these genes and, last but not least, between genes, gene
variants and the environment.
Functional
genomics. The combination of genomic information with
large-scale miniaturization and automation, will bring great
advances in insights. Even today, the increasing power of
bioinformatics (databases, image processing and Internet use),
nanotechnology (the DNA-chip) and automation (laboratory robotics)
is putting us on the eve of a quantum leap in understanding
of functional networks in living organisms, via an unprecedented
scaling-up of information gathering, processing and interpretation.
The systematic description of the data in genome projects
of man and other organisms, is just the first essential step
on a long way. The descriptive stages of the genome project
are currently being complemented by high-throughput technology
to discover the functions and interactions of the newly found
genes. The combined results of cross-comparison of the data
between different genes of one organism and between the genes
and genomes of different organisms, the development of targeted
animal models for human disease and the large-scale parallel
analysis of gene-expression profiles of tissues in normal
versus diseased state and during growth and development, will
fundamentally improve our capacities at complex pattern recognition.
In scientific research, this is the ever-recurrent basis for
new, verifiable hypotheses. Predictably therefore, the quest
for the biological functions and interactions of our genes
and the subsequent development of medical interventions, be
it via gene therapy, or probably more often through pharmacological
routes, uncovered by our improved understanding, will undergo
an unparalleled blossoming time. This will apply equally to
'classical' genetic diseases and to common, multifactorial
disorders like cardiovascular disease, cancer, hypertension,
rheumatic arthritis, migraine, Parkinson and Alzheimer and
even to infectious diseases and injury hazards.
Pharmacogenomics.
The aim of the human genome diversity (HGD) project, a population-based
offspring of the genome project, is the elucidation of the
individual variation of genes. Recently, a section of this
variation has attracted major attention of the biotech/pharma
industry. The study of genetic factors governing drug response,
a field dubbed 'pharmacogenomics', is widely expected to hold
tremendous promise for better targeted pharmacological treatment.
Currently ill-understood differences in efficacy and side-effects
of medicine between different persons are most likely based
on genetic differences in drug uptake and metabolism. It is
easy to see how the elucidation of these differences could
unlock major possibilities for more effective, individually-tailored
medical treatments. This promises to greatly reduce health
care cost due to ineffective or even disadvantageous drug
treatments.
HUGO.
Early in the genome research era, scientists worldwide recognized
that a free international competition, while healthy as a
mechanism to stimulate rapid progress, could potentially cause
important wasting of abilities and resources. In contrast,
a basic level of coordination and international collaboration,
when properly monitored, was envisaged to greatly increase
efficiency and assist in generating resources and defining
problem areas requiring special attention. To stimulate this
coordination, the Human Genome Organization (HUGO) was conceived
in 1988 and founded in 1989. In line with the state of research,
the major emphasis of HUGO has initially been on guiding the
mapping, which previously had proceeded under an even looser,
self-imposed scientific regime, the Human Gene Mapping (HGM)
workshops, organized biannually by the gene mapping community.
HUGO monitors and assists the genome mapping process on one
hand by organizing annual Human Genome Mapping Meetings, and
on the other hand by a program of smaller meetings and workshops
on specific topics, including, for example, the program of
single-chromosome workshops, supported by the European Community
and NIH and DOE in America.
Mapping,
sequencing or both? Until recently, the emphasis of genome
mapping has been on the detailed mapping of genes on a chromosome-by-chromosome
basis. However, with the current methodological advances,
novel whole-genome high-throughput sequencing methods have
been established. By these methods, currently about 30.000
human expressed genes have been identified and mapped with
great precision to specific subregional locations in the working
draft of the human genome. This working draft was announced
on June 26, 2000. For the high-quality finishing stage of
whole genome sequencing, the debate is ongoing on whether
refined regional approaches by expert laboratories are still
required or alternatively, if sequence-based techniques have
become powerful enough to abolish the need for scrupulous
mapping altogether. The popularity, in the medically oriented
community, of the genome database GDB, established in 1989
and one of the world's mapping database par excellence, underscores
that there is a need for a well-curated, comprehensive phenotype-oriented
mapping and annotation database. This will greatly assist
functional genomics by providing not only gene maps but also
a reference of freely accessible mapping and application data
for clinical research, an indispensable condition for the
fruitful translation of genome knowledge into practical solutions.
This is both true for the mapping of major common diseases
and for direct clinical applications in (differential) diagnosis.
Intellectual
Property. In the wake of the rapid advances in discovering
new genes, a fierce debate has emerged on public-versus-private
aspects of our genome heritage. Especially in the field of
the analysis of human cDNA/gene sequences and their comparison
with other species to unravel function, major issues are still
unresolved on how to strike the balance between, on one hand,
maximal scientific progress and public benefit - typically
served by immediate public access of newly generated data
- and proper patent protection of intellectual property of
inventions on the other hand, required to safeguard the staggering
investments to develop therapies. The existence of an independent
international organization like HUGO, which does not report
directly to specific governments, industries or funding bodies,
is an important asset to an unbiased international discussion.
In 1992, 1996, 1997 and 2000, HUGO has generated policy papers
on public access, patenting and related intellectual property
issues including single-nucleotide polymorphisms (SNPs) and
the effect of the European Directive on patenting biologicals
(http://www.hugo-international.org).
Genetic
services and education. On the side of caution, major
points have yet to be addressed to convert the new opportunities
of genetic services into beneficial healthcare: First, the
provision of requested information, which may be very burdensome
to the applicant, needs to be properly embedded in expert
clinical-genetic healthcare and preceded as well as guided
by well-designed, understandable information. This requires
additional research into the impact of genetic information
and expansion of the professional field, to deal with increased
possibilities and demand. The future implementation of screening
programs for major genetic diseases, to widen the access of
the public to voluntary preventive and therapeutic options,
including life-style choices, increases the need to politically
address the level of professional care provision. Furthermore,
also the public needs to be better educated into the value,
impact and limitations of genetics, especially adolescents
as consumers of these services in the immediate future.
Global
ethics. An entirely different, but equally important question
is whether out society is ready to assimilate these profound
changes. One should not overestimate the adaptive capacity
of societies. Not even regionally, let alone worldwide. Nationally,
and more in general in Western societies, the threat looms
of privacy infringements, unequal access to health care and
selective in- or exclusion from insurances or labour. Clearly,
it would defeat the purpose of genetics research, when exactly
those who should benefit most from the developments are put
in jeopardy by them. The threat of social inequity also exists
on a global scale: There is a great diversity in cultures,
social priorities and economic strength between Europe, Asia
and America. Even greater health and wealth differences exist
between the North and the South. Thus, the establishment of
ethically and morally acceptable standards has to be approached
with great caution, involving many more parties than only
the western scientists, industry and policy makers. This is
yet another field where an independent, worldwide organization
of knowledgeable professionals of genomics and genetics background
has an invaluable contribution, next to regional genetics
societies like the American, European and Australasian Societies
for Human Genetics and other world agencies like UNESCO. In
1996, the HUGO Ethics committee has published a 'Statement
on the Principled Conduct of Genetics Research', which has
been widely acclaimed. This is the basis of further refinements
applicable to specific situations. The first specific issue
has been addressed in the statement "DNA Samples: Control
and Access", which has been published in march 1998,
at HUGO's HGM98 Meeting in Turin. This statement seeks to
carefully balance privacy issues versus the value of coded
maintenance of sample identity, with a view on future validation
of epidemiological and specific studies. In March 1999, at
the Brisbane HUGO Meeting HGM99 the "HUGO Ethics Statement
on Cloning" was issued, placing this contentious issue
in the current framework of human genetics. Further HUGO statements
are on "Benefit sharing (April 2000) and "Gene Therapy"
(April 2001). These statements are typically prepared by the
HUGO Ethics committee after thoroughly reviewing 60-80 statements
and documents from national and international bodies, private
and governmental, and are intended to assist national and
supranational policy determination and standardization of
ethical review.
Economics
and funding. As witnessed by the ongoing fierce debate
on public versus private issues, commercial development in
different western regions and increased attention for less
privileged populations, an international dialogue is in order
on how to reap the profits of our insights on a balanced,
worldwide scale and how to prevent just another increase of
the technological gap. Indeed, most European nations could
easily end up on the 'wrong' side of this gap. The increasing
hi-tech nature of advanced biomedical research and the recent
enormous funding increase in biomedical research in the US
and Japan, combined with the comparatively limited investments
in fundamental biomedical research in Europe, are about to
put a heavy mortgage on the role which Europe may still play
in areas of decisive impact on health care and economy in
the 21st century. In order to reap major benefits for human
health care in diagnostic, therapeutic and preventive medicine
and to realize the vast array of business opportunities, a
more active stimulation of genome research should be considered
a priority task by national ministries of health, education
and economy, supranational bodies such as the European Union
and funding agencies in healthcare such as the Wellcome Trust,
where possible assisted by the biotech and pharmaceutical
industrial field.
Relevant websites: www.celera.com;
www.dmd.nl;
www.ebi.ac.uk;
www.gene.ucl.ac.uk;
www.geneclinics.com;
www.hugo-international.org/hugo;
www.gdb.org;
www.infobiogen.fr;
www.lgtc.nl;
www.ncbi.nlm.nih.gov/dbEST;
www.ncbi.nlm.nih.gov/omim;
www.ornl.gov/hgmis;
www.orphanet.com
B.
A terminology management workbench for molecular biology
Sophia Ananiadou,1 Hideki Mema,2 and
Goran Nenadic3
1:
Computer Science, University of Salford, United Kingdom
S.Ananiadou@salford.ac.uk
2: Dept. of Information Science, University of Tokyo, Japan
mima@is.s.u-tokyo.ac.jp
3: Computer Science, University of Salford, United Kingdom
G.Nenadic@salford.ac.uk
In
this paper we introduce the design of a web-based integrated
terminology management workbench, in which information extraction
and intelligent information retrieval/database access are
combined using term-oriented natural language tools. Our work
is placed within the BioPath research project whose overall
aim is to link information extraction to expressed sequence
data validation. The aim of the tool is to extract automatically
terms, to cluster them, and to provide efficient access to
heterogeneous biological and genomic databases and collections
of texts, all wrapped into a user friendly workbench enabling
users to use a wide range of textual and non-textual resources
effortlessly. For the evaluation, automatic term recognition
and clustering techniques were applied in a domain of nuclear
receptors.
C.
Mining for Gene Nomenclature
Wain HM, Bruford EA, Lovering RC, Lush M, Wright MW, Povey
S.
HUGO
Gene Nomenclature Committee, The Galton Laboratory, Department
of Biology, University College London, United Kingdom, nome@galton.ucl.ac.uk
"One
gene, one name" is an important concept for accurate
communication in genetic research. The HUGO Gene Nomenclature
Committee (HGNC) has so far been responsible for naming one
third of the genes estimated from the draft of the human genome
sequence.
Genew,
the Human Gene Nomenclature Database <http://www.gene.ucl.ac.uk/cgi-bin/nomenclature/searchgenes.pl>,
is the only resource that provides data for all human genes
which have approved symbols. The data in Genew are highly
curated by HGNC editors and gene records can be searched on
the Web by symbol or name to directly retrieve information.
A number of files are exported from Genew and these provide
further data to assist information extraction including lists
of known aliases and previous symbols, as well as sequence
accession IDs. These are available from <http://www.gene.ucl.ac.uk/nomenclature/code/ftpaccess.html>.
There
are a number of potential problems inherent in searching scientific
text, so we provide a unique approved gene symbol for each
human gene and encourage its use by journals and databases.
To this end, we concurrently edit both our Genew database
and LocusXRef, a database that feeds directly into LocusLink
<http://www.ncbi.nlm.nih.gov/LocusLink/>.
The
HUGO Gene Nomenclature Committee web page is at URL http://www.gene.ucl.ac.uk/nomenclature/.
The work of the HGNC is supported by NIH contract N01-LM-9-3533
(60%) and by the UK Medical Research Council (40%).
D.
Biosemantics. Towards computer-assisted extraction and processing
of biological information
B. Mons
Institute
of Medical Informatics, Erasmus University Rotterdam, Netherlands;
barend.mons@inter.nl.net
The
explosion of textual and molecular information available to
biologists today is creating on the one hand an unprecedented
collective resource for research, but on the other hand has
an almost suffocating effect in that it has become entirely
impossible to read the relevant literature in anyone's discipline.
Also information on complex interactions will be more and
more spread over a wide range of individual research papers.
It becomes almost impossible to rapidly make the connection
between multiple factors influencing biological processes,
for example as indicated by a group of seemingly unrelated
genes being expressed as a response to an experimental stimulus.
The
rapid expansion of databases containing information on molecules
and their interaction adds another layer of complexity to
the information we have to process.
Rather
than re-educating all molecular biologists to become computational
biologists we must be pro-active in the set up of respectful
collaboration between molecular biologists, geneticists and
computational experts working on molecular and textual analysis.
As
cellular and molecular biology essentially address molecules,
their aggregation into organelles and their interactions,
the knowledge about relations between molecules grows exponentially
to the discovery of the molecules themselves. Methodologies
to assist in information retrieval and interpretation are
still in their infancy and those that have been published
are experimental tools and far from being stable products
for generic use.
The
new group on BIOSEMANTICS, as recently established at the
Erasmus University in Rotterdam will focus on the development
of tools to assist interactive knowledge mining in molecular
databases and in full text literature. The systems will be
partly based on Collexis® technology, which is able to
abstract concepts (including molecular concepts) from vast
amounts of textual information with a speed which for the
first time to make this process interactive with the user.
By adding the (biological) relationship between the concepts,
we will create "conceptual semantic networks", which
will allow rapid analysis of overlapping biochemical pathways
and molecular relationships with unprecedented speed. With
major databases and publishers, arrangements have been made
already to build this system with the highest quality partner
network available today.
The seminar will give a preview on the systems to be designed
and the discussion will focus on potential collaboration between
Leiden and Rotterdam in this field.
E.
Information extraction from biomedical texts
Jerry R. Hobbs
Artificial Intelligence Center , SRI International, Menlo
Park, California, USA;
hobbs@ai.sri.com
Information
extraction is the process of scanning text for information
relevant to some interest, including extracting entities,
relations, and events. It requires deeper analysis than key
word searches, but its aims fall short of the very hard and
long-term problem of full text understanding. Information
extraction represents a midpoint on this spectrum, where the
aim is to capture structured information without sacrificing
feasibility.
One
of the key ideas in this technology is to separate processing
into several stages, in cascaded finite-state transducers.
The earlier stages recognize smaller linguistic objects and
work in a largely domain-independent fashion. The later stages
take these linguistic objects as input and find domain-dependent
patterns among them.
There
are now initial efforts to apply this technology to biomedical
text, In other domains, the technology plateaued at about
60% recall and precision. Even if applications to biomedical
text do no better than this, they could still prove to be
of immense help to curatorial activities.
F.
"Deep" information extraction from biomedical documents
Udo Hahn,1 Stefan Schulz,1,2
and Martin Romacker2
1:
Text Knowledge Engineering Lab, Freiburg University, Freiburg,
Germany
2: Department of Medical Informatics, Freiburg University
Hospital, Freiburg, Germany
MEDSYNDIKATE
is a natural-language processor for automatically harvesting
knowledge from medical finding reports. The content of these
documents is transferred to formal representation structures
which constitute a corresponding text knowledge base. The
system architecture integrates requirements from the analysis
of single sentences, as well as those of referentially linked
sentences forming cohesive texts. The strong demands MEDSYNDIKATE
poses to the availability of expressive knowledge sources
are accounted for by two alternative approaches to (semi)automatic
knowledge engineering. We also present data for the knowledge
extraction performance of MEDSYNDIKATE for three major syntactic
patterns in medical documents.
G.
Semantic induction with EMILE
Pieter Adriaans
Institute
for Logic, Language and computation, University of Amsterdam,
Amsterdam, Netherlands; pieter.adriaans@ps.net
The
production of textual information in the biomedical domain
is very large. More than 400.000 papers are added to Medline
per year. Recently researchers have started to work on the
implementation and use of text mining tools to extract knowledge
from these huge data sets. For the larger part these tools
employ more traditional text mining techniques like frequency
counts, n-gram search etc. This paper briefly describes the
possibilities of the use of more advanced grammar induction
tools for semantic learning. We introduce the EMILE grammar
induction tool. Possibilities to use the tool for semantic
learning in specific domains, in particular bioinformatics
discussed and illustrated with some results. The value of
grammar induction tools over standard text mining solutions
lies in their more exhaustive structural analysis of the text.
The EMILE tool can be downloaded for research purposes at:
http://turing.wins.uva.nl/~pietera/Emile/
H.
Ontology-driven information extraction
Uwe Reyle1 and Jasmin aric2
1:
Institute for Computational Linguistics, University of Stuttgart,
Stuttgart, Germany; Uwe.Reyle@ims.uni-stuttgart.de
2: European Media Laboratory, Heidelberg, Germany;
Jasmin.Saric@eml.villa-bosch.de
We
describe the linguistic components of a system (GenIE) that
automatically extracts information about biochemical pathways
from free-text sources. We show that the extraction of information
must be based on specialized lexica, semantic representations
and ontologies. Furthermore a case study is presented that
motivates the use of deductive schemata and a systematic relationship
between low-level and high-level representations for the interleaving
of shallow and deep processing.
I.
Information extraction on health-related discussion forums
Stefan Geißler and Ramon Wartala
TEMIS
Deutschland; stefan.geissler@temis-group.com;
ramon.wartala@temis-group.com
Temis is a software and consulting company specializing in
products and solutions in the field of Text Mining. To analyse
huge text corpora, Temis offers solutions for categorization
(Insight Discoverer Categorizer), clustering (Insight
Discoverer Clusterer), information extraction (Insight
Discoverer Extractor) and presentation (OnlineMiner).
Our information extraction demo on health-related news groups
demonstrates and explains the main architecture and functionalities.
On the linguistic side, the information extraction process
uses plug-able modules for industry-, domain- and language-specific
knowledge (called SkillCartridges).The extraction technology
itself is the main component and based on cascaded finite-state
transducers. The extractor today supports seven different
European languages (Russian, Chinese and others in preparation).
Also, Temis offers a clusterer that uses a hierarchical clustering
approach efficient enough to cluster several thousands of
documents online. Finally, the OnlineMiner is a Java-Enterprise-based
Intranet component for searching, collecting, analysing and
clustering huge document corpora that integrates the above
mentioned components into a comprehensive application. Using
these three products, we analysed many medical- and health-related
news groups from the USENET. This sort of text corpora are
very heterogeneous and the main question are: Can Text Mining
help us to find interesting patterns of relations between
drugs and diseases? Another important question is: Which drugs
are mentioned most often in which subgroup of USENET users?
To find a solution, we defined domain-specific knowledge into
specific SkillCartridges based on official Drug lists and
Diseases thesauri. The extraction process produces trees with
concepts for all the coded knowledge. Together with the plain
text resources, we store the results in a database management
system (DBMS). Then, we can access the clustered results with
the OnlineMiner on the server-side and a normal Web browser
on the client. The OnlineMiner offers different views on the
collected data. The starting point for all there exploration
is an interface similar to that of a normal Internet search
engine. To get a more intuitive about the relevance of the
various concepts, we can show a cluster graph or a pie/bar-chart.
One result our demo system delivers is a clustering result
on the news group misc.health.diabetes that produces a chart
with the big four of diabetes medicines. This set-up allows
to aggregate information from larger text collections and
produce reports for detailed questions in a novel way.
J.
Protein functional classification by text data-mining
Ben Stapley
Biomolecular
Sciences, UMIST, Manchester, United Kingdom; b.stapley@icrf.icnet.uk
Classifying
proteins into classes based on their cellular role or function
is a powerful way of making sense of genomic data. Here we
present a method for the automatic classification of proteins
by analysis of relevant Medline abstracts. We employ keyword
matching using a thesaurus of S.cerevisiae gene naming terms
to retrieve relevant text for each protein. From these term
vectors are generated for each protein. We then train support
vector machines to automatically partition the term space
and to thus discriminate the textual features that define
a protein's function. We test the method on the task of assigning
proteins to various subcellular locations. The method is benchmarked
on a set of proteins of known sub-cellular location. No prior
knowledge of the problem domain nor any natural language processing
is used at any stage. Our method has comparable performance
to rule-based text classifiers and we find that amino acid
compositional information is a poor predictor when employed
in isolation. Combining text with protein amino-acid composition
improves recall. We discuss the generality of the method and
its potential application to a variety of biological classification
problems.
K.
Biological function and DNA expression arrays
Christian Blaschke,1 Luis Cornide,2
Juan Carlos Oliveros,3 and Alfonso Valencia4
1:
Protein Design Group at the CNB/CSIC, Cantoblanco, Universidad
Autonoma, Madrid, Spain; blaschke@cnb.uam.es
2: ALMA Bioinformatics, Tres Cantos, Spain; lcornide@almabioinfo.com
3: Protein Design Group at the CNB/CSIC, Cantoblanco,
Universidad Autonoma, Madrid, Spain; oliveros@cnb.uam.es
4: Protein Design Group at the CNB/CSIC, Cantoblanco,
Universidad Autonoma, Madrid, Spain; valencia@cnb.uam.es
DNA
arrays are one of the types of large-scale experiments that
have been developed over the last years. These experiments
allow new biological insights but also provide an overwhelming
flow of data that has to be digested and analysed properly.
We developed an information extraction system (GEISHA) that
provides an overview of the literature related to the genes
that are implicated in an experiment. It extracts keywords
and the most important parts of the related abstracts and
re-organizes the information in a way that with much less
effort a deeper insight in what was published already is possible.
Here we present an overview of the system and the results
that were obtained in different studies.
L.
A suite of tools to mine scientific abstracts for nuclear
receptors
Dietrich Schuhmann
LION
Bioscience AG, Heidelberg, Germany; Dietrich.Schuhmann@lionbioscience.com
Bioinformatics
has become a strong research area which offers and generates
large amounts of data to molecular biologists. Nowadays this
includes more and more tools to access data from the literature
(semi-) automatically. The consequence is the (semi-)automatic
extraction of information from abstracts and full papers concerned
with biological topics.
LION
Bioscience has developed proprietary software for the extraction
of simple facts from sentences. The software is based on cascaded
Finite State Transducer. The technology identifies patterns,
e.g. 'in vitro', 'in a ligand-dependent manner', parts of
sentences, e.g. verbs, nouns, adjectives, and combinations
of both. Such combinations can be used to identify parts of
sentences, which follow syntactical rules. For example a determiner
followed by an adjective and a noun is called a noun phrase,
e.g. 'the activated protein' is a noun phrase. Combinations
of noun phrases in conjunction with a verb are used to identify
relations between objects: e.g. 'Pleiotrophin binds to protein-tyrosine
phosphatase zeta', where 'Pleiotrophin' as well as the combination
of terms after the verb 'bind to' represent a noun phrase
each. Furthermore, the described technology has to reduce
information with the help of extraction of parts of the sentence:
e.g. 'Pleiotrophin has been demonstrated to bind to protein-tyrosine
phosphatase zeta (PTPzeta) with high affinity', is transformed
into 'Pleiotrophin binds to PTPzeta', if this is more suitable
to the user (biologist).
Different
software components have been put in place to realise the
running system: dictionaries of specific terms, software to
identify complex terms, tagging tools, the cascaded FSTs,
and viewing tools. The analysis tools can cope with any type
of sentence. This does not solve the complexity of the information
contained in such sentences.
An
example for a pleasant sentence is: 'These results suggest
that this peptide region is important for p21 interaction
with cyclin E/Cdk2'. It results into 'p21 interacts with cyclin
E/Cdk2'. The following sentence is an example where the anticipated
outcome is unclear: 'To determine if this region mediates
interaction with Ku70 in mammalian cells ...' It is unclear
what 'this region' refers to and whether 'in mammalian cells'
is an inevitable detail to the interaction with Ku70.
Any
information extraction method has to meet the needs of the
user. These needs must be clearly identified and described
for several reasons: (1) the perception of the written
information is different from user to user, (2) the
topics of interest depend on the user group, and (3)
standardisation (normalisation) of the information from the
written text is yet an unsolved problem. The text mining group
of LION Bioscience has run curation projects in collaboration
with the laboratory at LION. The core topic of the curation
projects were the identification of interactions between nuclear
receptors, cofactors and ligands to control genetic regulation.
This information was retrieved from Medline abstracts.
As
a result, the group has generated a dictionary which contains
about 1050 terms, i.e. names of nuclear receptors, cofactors,
ligands and species. Furthermore, the dictionary contains
about 11500 synonyms. This representation of terms is used
to integrate the results from the information extraction methods
into software solutions of LION Bioscience.
The
presented work is funded by the German government (Project
Nr. 0312385, Bio-Path). Project members are: Sophia Ananiadou,
Salford University, Manchester, UK, and Prof. Günthner,
Centrum für Informations- und Sprachverarbeitung, LMU
Munich, Germany. Furthermore, the presented work is part of
the Eureka project entitled Bio-Path. The project partner
is ValiGen, Paris, France.
M.
Raw data, funding, and knowledge
Gert Vriend
CMBI,
University of Nijmegen, Nijmegen, Netherlands; vriend@cmbi.kun.nl
These
days bioinformatics is going through a phase in which her
major work is managing the data explosion. Granting agencies
and industries have funded a technological revolution (very
fast sequencing, micro arrays, robotics), but have realized
too late that all these techniques generate gigabytes of raw
data. At this moment it is clear that lots of funding should
have been spent five years ago on bioinformatics, but that
did not happen. Today, however, there is plenty of money for
bioinformatics and I will try to summarize what is done, can
be done and should be done with this money. The enormous amount
of data opens up new possibilities for research. One example
is the analysis of multiple sequence alignments with thousands
of sequences, which allows us to determine the function of
almost every residue for the task(s) of the molecule in the
organism.
List
of participants ...
Pieter
Adriaans
University of Amsterdam, the Netherlands
|
Sophia
Ananiadou
University of Salford, United Kingdom |
Christian
Blaschke
Centro Nacional de Biotecnologia, Spain
|
Stephan
Brock
Metalife AG, Germany |
Luis
Cornide
Alma Bioinformatics S.L., Spain |
Michele
Finelli
University of Bologna, Italy |
Ulrich
Grob
Metalife AG, Germany
|
Roderic
Guigo
GRIM, IMIM-UPF, Spain |
Udo
Hahn
Freiburg University, Germany
|
Jerry
Hobbs
SRI International, USA |
Robert
Hoffmann
Centro Nacional de Biotecnologia, Spain
|
Hendri
Hondorp
University of Twente, the Netherlands |
Bernard
Jacq
Laboratoire de Genetique et Physologie du Developement,
France |
Annette
Martin
The Babraham Institute, United Kingdom |
Jan
De Meutter
Ghent University, VIB, Belgium |
Barend
Mons
Erasmus University Rotterdam, the Netherlands |
Ivan
Mraz
Inst. of Plant Molecular Biology, Czech Republic |
Anton
Nijholt
University of Twente, the Netherlands |
Gertjan
van Ommen
Leiden University Medical Center, the Netherlands |
Jan
Paces
Institute of Molecular Genetics, Czech Republic
|
Alberto
Pascual Montano
National Center of Biotechnology, Spain
|
Karel
Petrzik
Inst. of Plant Molecular Biology, Czech Republic |
Ivan
Rossi
University of Bologna, Italy
|
Jasmin
aric
European Media Laboratory GmbH, Germany |
Dietrich
Schumann
LION bioscience AG, Germany
|
Miroslav
Sip
Inst. of Plant Molecular Biology, Czech Republic |
Ben
Stapley
Biomolecular Modelling Lab, United Kingdom |
Osman
Ugur Sezerman
Sabanci University, Turkey |
Alfonso
Valencia
Centro Nacional de Biotecnologia, Spain |
Paul
van der Vet
University of Twente, the Netherlands |
Gert
Vriend
University of Nijmegen, the Netherlands |
Hester
Wain
University College London, United Kingdom |
Ramon
Wartala
TEMIS-group, France |
Radek
Zíka
Institute of Molecular Genetics, Czech Republic |
|