|
Data
management: Databases, interfaces and ontologies
This
area addresses the development of new guidelines and directions
in a form for rational use. Bioinformatics has become an integral
part of almost all projects within genome research. Data acquisition,
data analysis and databases to store and search for data are
the main areas where informatics supports the researcher.
Informatics is essential in various steps throughout an experiment,
especially for high throughput approaches producing large
amounts of complex data, such as hybridisations of complex
probes to high density membranes or microarray chips, expression
profiling experiments or protein 2-D analysis. Specific examples
are for image analysis to identify positions and densities
of spots after hybridisations, to convert captured images
to digital values, for statistical analysis, to compare data
points and normalise data sets, for data display and many
more.
Currently
there are a number of deficiencies. In contrast to structural
genomics, where many programme tools and databases exist,
only a few sophisticated tools have been developed so far
for the complex data arising from expression profiling studies:
two European databases are in advanced development, one at
the European Bionformatics Institute (Cambridge) and another
at the Max Planck Institute for Molecular Genetics (Berlin).
Furthermore, no major database exists to deal with the large
amounts of experimental data from different functional genomic
projects (ranging from DNA-based to protein-based techniques,
from knock-in/out mice to in situ hybridisations),
although a small number of specialised databases for certain
expression data have been developed. Despite several attempts
around the world to build databases specifically for the management
of biomolecular interactions, no standard has yet emerged.
In general, more programmes are available for the early stages
of experiments, such as the detection and quantification of
spots on microarrays or 2-D gels, than for the efficient and
user-friendly management, analysis and display of large complex
data sets.
The
incorporation of known functional information into databases
at various levels is thus a pressing need requiring the combined
efforts of experimentalists, computational biologists and
database developers. The challenge is particularly formidable
as new types of information, such as tissue and organ gene
expression patterns on a genome scale as well as numerous
data on protein interactions, post-translational modification
and protein structure are rapidly becoming available. The
continuous flow of information also requires 'update' and
'awareness' tools that filter and incorporate incoming data
(sequences, literature, etc.) and new applications (servers,
methods). These tools will systematically integrate the filtered
information into dynamic databases.
Central
questions related to the management of information in functional
genomics are accessibility to the data, classifications and
ontologies used, and the identification of errors. The variations
in database formats and technologies add significantly to
an already complex task of fruitfully accessing the available
data that is distributed over many different sites. Even though
many databases directly or indirectly reference data stored
elsewhere, these links are difficult to exploit due to the
large differences between individual database implementations.
A standardization of formats, or at least an agreed interface
for database interconnectivity, would greatly alleviate these
problems. Moreover, there is a need to improve accuracy of
existing databases, since not all the data is accurate and
most of it is likely to be less reliable than the gene sequences
themselves. Biologists must be able to rely on up to date,
accurate information on topics such as gene expression patterns.
Without
getting involved in maintaining a database as such, the programme
will aim to support efforts to achieve a common subset of
ontologies, classes and structures which are best able to
store and represent the experiments and resulting data in
the areas involved. It will also focus on identification or
definition of general ways of presenting and visualising the
complex data, probably in a graphical format. This is particularly
important for 'wet lab' workers wanting to analyze their data
and who are often unhappy with available databases, which
they may not find easy to use at a practical level. We would
seek to establish guidelines for the development of such interfaces,
taking into account the specific needs of experimentalists.
At
the level of European scientific infrastructure, the computer
requirements for functional genomics should be identified
and a proposal made for how the facilities may be provided
most effectively. Another issue is the economics of bioinformatics
resources, in terms of comparison of costs and effectiveness
of service from linking to a national, regional or local server
or to establish local copies of databanks (costs of equipment,
technical support staff). The value of shared access to software
and licensing should also be investigated. It will be important
to understand the problems in data archives and how to disseminate
information to scientists at the bench most effectively. Similarly,
we need to identify the problems of multiple copies of databanks
installed all over Europe, exisiting as they do in different
versions and different update levels. Possible solutions can
be discussed, e.g. provision of software to analyse what is
present locally, and a report written recommending updating
if appropriate.
Contacts
within the programme
John
Armstrong
Francisco Azuaje
Amos
Bairoch
Cyrus Chothia
Werner
Dubitzky
Thure
Etzold
Jay
Hinton
Benoit
Leblanc
Marie-Paule
Lefranc
Rune
Linding
Antoni
Matilla
Syed Asad Rahman
Isabel
Rojas
Susanna-Assunta
Sansone
Peter
Savic
Gavin
H. Thomas
Paul
van der Vet
Ulrike
Wittig
Marc
Zabeau
Günther
Zehetner
|