|
Microarray
data untapped powers and hidden weaknesses
September 8 and 9, Leuven, Belgium
Organisers:
Martin
Kuiper Plant Systems Biology, Flanders Interuniversity
Institute for Biotechnology (VIB) / Gent University
Paul Van Hummelen VIB Microarray Facility
Yves Moreau ESAT / SISTA, KU Leuven
Location
of Workshop:
Faculty
Club
Room: Lemaire
Groot Begijnhof
Middenstraat 14
3000 Leuven - België
tel: +32.16.329500 - fax: +32.16.329502
e-mail: info@fclub.kuleuven.ac.be
http://www.facultyclub.be/
Report
The
Workshop lay-out was inspired by the EU CAGE project: a FPV
demonstration project aimed to take a new microarray platform
for profiling of an extensive range of biological samples;
to pre-process the data, to pre-compute 1st order mining data,
and to make all data available through ArrayExpress - EBI.
The compendium data will represent a resource for functional
annotation of Arabidopsis genes.
The
Workshop was aimed to span the whole area of microarray data
production and analysis, including different microarray platforms,
data pre-processing strategies (normalization), databasing,
clustering and modeling approaches. The focus of the sessions
was to consider strengths and weaknesses in the data, and
to try to be pro-active in quality assessment, and experimental
design.
The
Workshop was organized in 5 sessions, a summary of the highlights
is given below.
1. Platform comparison
In
this session 5 speakers discussed different implementations
of the microarray platform, including novel technology developments.
Paul Van Hummelen gave an overview of services that can be
obtained through the Micro Array Facility (MAF) from the VIB.
Maf provides a service both with commercial platforms and
with custom (spotted) arrays.
Jim
George, application specialist from Amersham Biosciences,
presented their new CodeLink system, consisting of a solid
support coated with a polyacrylamide matrix. This matrix provides
an enhanced hybridisation environment, resulting in a significantly
higher signal to noise ratio.
Seamus
O'Regan, from Affymetrix, presented Affy's new 11 micron (feature
size) chip, allowing the monitoring of 61,000 transcripts
with a total of 1,300,000 data points per array. The volume
of data generated with such slides is enormous, and Affymetrix
has spent a considerable effort at the development of an integrative
data analysis system. Part of this system is available through
the web as an open resource (NetAFFx, experiment design, analysis
of data, Gene Ontology Mining Tool)
Andreas
Ruehlmann, from Agilent Technologies, presented developments
of the Agilent ink-jet technology. The synthesis on a chip
with standard phosphoamidites allows the production of oligos
longer than with the photolithography approach. This gives
greater sensitivity and specificity. They reported mRNA detection
limits of 0.2 copies per cell.
Catherine
Nguyen, from the TAGC group, Marseille, FRANCE, reported on
the use of nylon arrays (with up to 10,000 spots). With their
arrays they could identify classification markers for tumor
types.
In
a discussion on the need for Core facilities, when there is
such an easy access to commercial platforms, it was mentioned
that a major advantage to the user is the fact that Core facilities
accumulate now-how on experiment design, and as such can advise
a scientist in the design of his/her experiment.
2.
Normalisation
Frank
Holstege: Goal of normalization is to counterbalance the non-biological
variation present in microarray data. This can be done on
two levels. On one hand, you can consider variation between
the arrays; on the other hand there is also variation within
the array e.g. variation caused by the different print tip.
This talk is especially focussed on the within-array variation.
To capture this variation one uses sometimes genes that show
no change during the experiment, the so-called housekeeping
genes. In practice this is not such a good idea, as genes
that show no change across the experiments almost don`t exist.
So nowadays, most current methods use all genes and one relies
on the assumption that there is no overall global change or
that the changes are at least balanced. So, one assumes that
the number of up-regulated genes equals the number of down-regulated
ones. But what happens if large global changes do take place?
Then the only reliable way to normalize is to use the spiked-in
controls.
Robert
Nadon: In microarray data analysis we encounter a dimensionality
problem: there are thousands of genes and only a limited number
of observations per gene, typically between 1 and 3 observations.
This is not ideal for statistical analysis. One should at
least have 3 observations per datapoint. Since an experiment
can suffer from many sources of bias, if you can counterbalance
things in your experiments then do it! Bias correction may
add bias and may also add random error through error propagation.
It's better to not use ratio's as these are not symmetrical
and you can't take averages. Nadon proposes the use of MAS5
and especially RMA for robust multichip analysis. RMA implements
the quantile normalization and standard errors are in general
a lot smaller than when MAS5 is applied. Therefore this RMA
technique is more favourable compared to MAS5, except for
low-abundance genes.
Jörg
Hoheisel: Although there are working systems of array technology,
some technical problems remain. One important aspect is the
sensitivity and selectivity of the binding of the assayed
DNA molecules and also the fact that the studied DNA-samples
usually require a (PCR) amplification and (fluorescence) labelling
prior to analysis. The structural difference between PNA -
used as probe on the array - and a DNA-target permits a direct
detection of the nucleic acid by a simple process that is
much more sensitive than current techniques, actually avoiding
these preparative steps. Upon hybridisation of a DNA or RNA
sample to a PNA-array, the phosphates of the DNA/RNA can be
utilised as an intrinsic label for detection by secondary
ion mass spectrometry (SIMS); PNA molecules are lacking phosphate
groups entirely.
3.
Standards + Databases
Alvis
Brazma: A tradition in life sciences is that data supporting
publications should be publicly available. The Microarray
Gene Expression Data (MGED) Society was established in 1999
at Cambridge and is an international organisation of biologists,
computer scientists, and data analysts that aim to facilitate
the sharing of microarray data generated by functional genomics
and proteomics experiments. The current focus is on establishing
standards for microarray data annotation and exchange, facilitating
the creation of microarray databases and related software
implementing these standards, and promoting the sharing of
high quality, well annotated data within the life sciences
community.
Minimum Information About a Microarray Experiment (MIAME)
outlines the information that must be provided about a microarray
gene expression experiment to allow unambiguous interpretation
of its results. Revealing of MIAME compliant data has now
been adopted by many scientific journals as a requirement
for microarray based publications.
ArrayExpress is a public repository for microarray data, which
is aimed at storing well annotated data in accordance with
MGED recommendations. It has now about 50 submissions in its
public database which amount to 1200 hybridisations. A third
of these were submitted via MIAMExpress, partners that produce
large amount of data submit via special pipelines.
Future of ArrayExpress:

The
repository only allows simple queries. The warehouse will
be able to handle more complex queries and will have a gene
centric design. Finally the gene expression atlas will be
a renormalized version of the warehouse and will contain summarized
information such as which gene is expressed where.
Carl
Troein: Presented the BioArray Software Environment (BASE).
BASE is a free open source software which can be used to store,
manage and analyze microarray data. It consists of the following
integrated parts:
- Samples and treatments
- Array LIMS
- Hybridisations and images
- Pre-analysis such as filtering and normalization
- Analysis through plug-ins
- User administration
The
schema of BASE is based on ArrayExpress. MAGE-ML export is
currently in test-phase. Up till now there have been more
then 1200 downloads, several laboratories are using the system
and also contribute code. For the future a new BASE system
will be developed from scratch (BASE version 2), using the
knowledge and experience gained from developing BASE version
1.
Discussion:
GEO's
structure is more laid back, will people tend to use GEO more
because it's less work for them to submit their data?
Brazma: GEO doesn't have a big software development group,
ArrayExpress is more systematic and has good datasets and
many collaborators, we work with people that produce high-quality
datasets.
Is ArrayExpress interested in working together with BASE so
small labs can have their own 'BASE Express'?
Brazma: BASE can handle already most things, do we wait for
BASE version 2? Do we apply for a EU grant to develop such
a system?
SAGE data can't be stored in ArrayExpress, can MIAMExpress
be used to store this type of data?
Brazma: We potentially accept such data, the problem is more
a problem of resources and having an expert at EBI.
Why do we need a separate repository if you will develop a
data warehouse?
Brazma: The data warehouse won't be MIAME-compliant and only
the high-quality data will be stored there.
4,
5: Integrative data analysis 1 & 2
As
data analysis remains a major challenge for microarray analysis,
two sessions were devoted to this topic. The first presentation,
by Martin Kuiper of the Flemish Institute for Biotechnology
(Ghent, BE), was devoted to the Compendium of Arabidopsis
Gene Expression project, a large ongoing EU FP5 project aimed
at the execution of 4.000 full-genome Arabidopsis microarrays
to produce a prototype for a compendium covering a large part
of the Arabidopsis transcriptome (different developmental
stages, organs, environmental conditions, ecotypes, or mutants).
Such a project emphasizes again how big the challenges for
microarray data analysis are.
The
second presentation, by Ritsert Jansen from the University
of Groningen (NL), put forward an intriguing standpoint for
microarray analysis towards gene network reconstruction. It
proposed to combine the power of classical genetic analysis
(such as recombinant inbred lines) with the power of microarray
experiment. In such a setup, variations in gene expression
are expressed in terms of genotypic variations. By determining
the genotypic pattern in each experiment (for example, through
allelic markers), the contribution of the different loci to
gene expression can be determined, which is a large step towards
the reconstruction of gene networks.
Next,
the presentation by Jaak from Egeen (Tartu, Estonia) focused
on Expression Profiler, a set of tools for gene expression
analysis linked to ArrayExpress (developed while at the European
Bioinformatics Institute). In particular, tools for cis-regulatory
sequence analysis (such as SPEXS) were discussed.
The
presentation by Yves Moreau (Katholieke Universiteit Leuven,
BE) focused on a generic statistical method called Gibbs sampling,
which is a Markov Chain Monte Carlo technique. It was argued
that this general technique can be flexibly used in a variety
of data analysis problems in bioinformatics. This was illustrated
by its application to both the analysis of cis-regulatory
sequences (Gibbs sampling for motif finding) and to the (bi)clustering
of microarray data. Biclustering of microarray data is particularly
relevant in the case of analysis of heteregeneous sets of
microarray data, where not every experiment may be relevant
to the detection of a relationship between two genes at the
transcriptional level.
In
the second session on the theme of integrative data analysis,
the presentation by Thomas Lengauer (Max Planck Institute,
Saarburcken, DE) focused on mapping gene expression data onto
pathways for the detection of expression patterns at the level
of the pathway. Such an approach increases the interpretability
of microarray data by mapping it at a level (i.e., pathways)
that is more meaningful to biologists.
The
next presentation by Iftach Nachman from Hebrew University
(Jerusalem, IL) concentrated on the learning of regulatory
networks using probabilistic graphical models (such as Bayesian
networks and probabilistic relational models). A semi-physical
model of transcriptional regulation was introduced and translated
into a probabilistic model that simultaneously involved the
cis-regulatory sequence and the expression data. Methods for
learning such models from data were presented.
The
next presentation by Frank Holstege from UMC Utrecht (NL)
presented methods for the combination of multiple types of
genomic data, such as expression data and protein-protein
interaction data to reconstruct gene networks. He also discussed
to what extend the different types of data were correlated.
Finally,
Irit Gat-Viks from Tel-Aviv University (IL) presented a set
of problems ranging from clustering to gene network reconstruction
directly involved in the integrative analysis of microarray
data and discussed methods for their solution based on discrete
mathematics (such as graph theory methods).
A
number of themes emerged from the presentations and the discussions.
First, and on a sobering note, it was acknowledged that, regardless
of the amount of high-throughput data available, ab initio
reconstruction of entire gene networks is likely to be impossible.
There is therefore a need for methods that focus on smaller
modules within the network and that incorporate prior knowledge
of the problem (this was for example illustrated in the presentations
by Thomas Lengauer en Iftach Nachmann). Second, it appears
that the analysis of microarray data is about to change radically.
While biologists have so far mostly focused on the analysis
of their own data (i.e., produced in the course of an own
project), the exponential growth in the amount of publicly
available data means that shortly gene expression studies
will systematically start with an analysis of the data available
from compendia (such as the CAGE project) and from repositories
(such as ArrayExpress). This primary analysis will generate
relevant biological hypotheses that will be then validated
through follow-up experiments. New tools are needed within
this changing context (as illustrated in the presentations
by Martin Kuiper, Jaak Vilo, and Yves Moreau).
Highlights and Abstracts from the meeting made available
by Martin Kuiper are availbale here.
|