- Training Courses
- Workshops
- Grants & Fellowships
- European Conference in Functional Genomics
- Meeting Reports
- Online Registration

 

 

Microarray data untapped powers and hidden weaknesses
September 8 and 9, Leuven, Belgium

Organisers

additional sponsorship by...

Report
1.
Platform comparison
2. Normalisation
3. Standards + Databases
4. and 5. Integrative data analysis

Highlights

Organisers:
Martin Kuiper Plant Systems Biology, Flanders Interuniversity Institute for Biotechnology (VIB) / Gent University
Paul Van Hummelen VIB Microarray Facility
Yves Moreau ESAT / SISTA, KU Leuven

Location of Workshop:
Faculty Club
Room: Lemaire
Groot Begijnhof
Middenstraat 14
3000 Leuven - België
tel: +32.16.329500 - fax: +32.16.329502
e-mail: info@fclub.kuleuven.ac.be
http://www.facultyclub.be/

Report

The Workshop lay-out was inspired by the EU CAGE project: a FPV demonstration project aimed to take a new microarray platform for profiling of an extensive range of biological samples; to pre-process the data, to pre-compute 1st order mining data, and to make all data available through ArrayExpress - EBI. The compendium data will represent a resource for functional annotation of Arabidopsis genes.

The Workshop was aimed to span the whole area of microarray data production and analysis, including different microarray platforms, data pre-processing strategies (normalization), databasing, clustering and modeling approaches. The focus of the sessions was to consider strengths and weaknesses in the data, and to try to be pro-active in quality assessment, and experimental design.

The Workshop was organized in 5 sessions, a summary of the highlights is given below.

1. Platform comparison

In this session 5 speakers discussed different implementations of the microarray platform, including novel technology developments. Paul Van Hummelen gave an overview of services that can be obtained through the Micro Array Facility (MAF) from the VIB. Maf provides a service both with commercial platforms and with custom (spotted) arrays.

Jim George, application specialist from Amersham Biosciences, presented their new CodeLink system, consisting of a solid support coated with a polyacrylamide matrix. This matrix provides an enhanced hybridisation environment, resulting in a significantly higher signal to noise ratio.

Seamus O'Regan, from Affymetrix, presented Affy's new 11 micron (feature size) chip, allowing the monitoring of 61,000 transcripts with a total of 1,300,000 data points per array. The volume of data generated with such slides is enormous, and Affymetrix has spent a considerable effort at the development of an integrative data analysis system. Part of this system is available through the web as an open resource (NetAFFx, experiment design, analysis of data, Gene Ontology Mining Tool)

Andreas Ruehlmann, from Agilent Technologies, presented developments of the Agilent ink-jet technology. The synthesis on a chip with standard phosphoamidites allows the production of oligos longer than with the photolithography approach. This gives greater sensitivity and specificity. They reported mRNA detection limits of 0.2 copies per cell.

Catherine Nguyen, from the TAGC group, Marseille, FRANCE, reported on the use of nylon arrays (with up to 10,000 spots). With their arrays they could identify classification markers for tumor types.

In a discussion on the need for Core facilities, when there is such an easy access to commercial platforms, it was mentioned that a major advantage to the user is the fact that Core facilities accumulate now-how on experiment design, and as such can advise a scientist in the design of his/her experiment.

2. Normalisation

Frank Holstege: Goal of normalization is to counterbalance the non-biological variation present in microarray data. This can be done on two levels. On one hand, you can consider variation between the arrays; on the other hand there is also variation within the array e.g. variation caused by the different print tip. This talk is especially focussed on the within-array variation. To capture this variation one uses sometimes genes that show no change during the experiment, the so-called housekeeping genes. In practice this is not such a good idea, as genes that show no change across the experiments almost don`t exist. So nowadays, most current methods use all genes and one relies on the assumption that there is no overall global change or that the changes are at least balanced. So, one assumes that the number of up-regulated genes equals the number of down-regulated ones. But what happens if large global changes do take place? Then the only reliable way to normalize is to use the spiked-in controls.

Robert Nadon: In microarray data analysis we encounter a dimensionality problem: there are thousands of genes and only a limited number of observations per gene, typically between 1 and 3 observations. This is not ideal for statistical analysis. One should at least have 3 observations per datapoint. Since an experiment can suffer from many sources of bias, if you can counterbalance things in your experiments then do it! Bias correction may add bias and may also add random error through error propagation. It's better to not use ratio's as these are not symmetrical and you can't take averages. Nadon proposes the use of MAS5 and especially RMA for robust multichip analysis. RMA implements the quantile normalization and standard errors are in general a lot smaller than when MAS5 is applied. Therefore this RMA technique is more favourable compared to MAS5, except for low-abundance genes.

Jörg Hoheisel: Although there are working systems of array technology, some technical problems remain. One important aspect is the sensitivity and selectivity of the binding of the assayed DNA molecules and also the fact that the studied DNA-samples usually require a (PCR) amplification and (fluorescence) labelling prior to analysis. The structural difference between PNA - used as probe on the array - and a DNA-target permits a direct detection of the nucleic acid by a simple process that is much more sensitive than current techniques, actually avoiding these preparative steps. Upon hybridisation of a DNA or RNA sample to a PNA-array, the phosphates of the DNA/RNA can be utilised as an intrinsic label for detection by secondary ion mass spectrometry (SIMS); PNA molecules are lacking phosphate groups entirely.

3. Standards + Databases

Alvis Brazma: A tradition in life sciences is that data supporting publications should be publicly available. The Microarray Gene Expression Data (MGED) Society was established in 1999 at Cambridge and is an international organisation of biologists, computer scientists, and data analysts that aim to facilitate the sharing of microarray data generated by functional genomics and proteomics experiments. The current focus is on establishing standards for microarray data annotation and exchange, facilitating the creation of microarray databases and related software implementing these standards, and promoting the sharing of high quality, well annotated data within the life sciences community.
Minimum Information About a Microarray Experiment (MIAME) outlines the information that must be provided about a microarray gene expression experiment to allow unambiguous interpretation of its results. Revealing of MIAME compliant data has now been adopted by many scientific journals as a requirement for microarray based publications.
ArrayExpress is a public repository for microarray data, which is aimed at storing well annotated data in accordance with MGED recommendations. It has now about 50 submissions in its public database which amount to 1200 hybridisations. A third of these were submitted via MIAMExpress, partners that produce large amount of data submit via special pipelines.

Future of ArrayExpress:

The repository only allows simple queries. The warehouse will be able to handle more complex queries and will have a gene centric design. Finally the gene expression atlas will be a renormalized version of the warehouse and will contain summarized information such as which gene is expressed where.

Carl Troein: Presented the BioArray Software Environment (BASE). BASE is a free open source software which can be used to store, manage and analyze microarray data. It consists of the following integrated parts:
- Samples and treatments
- Array LIMS
- Hybridisations and images
- Pre-analysis such as filtering and normalization
- Analysis through plug-ins
- User administration

The schema of BASE is based on ArrayExpress. MAGE-ML export is currently in test-phase. Up till now there have been more then 1200 downloads, several laboratories are using the system and also contribute code. For the future a new BASE system will be developed from scratch (BASE version 2), using the knowledge and experience gained from developing BASE version 1.

Discussion:
GEO's structure is more laid back, will people tend to use GEO more because it's less work for them to submit their data?
Brazma: GEO doesn't have a big software development group, ArrayExpress is more systematic and has good datasets and many collaborators, we work with people that produce high-quality datasets.
Is ArrayExpress interested in working together with BASE so small labs can have their own 'BASE Express'?
Brazma: BASE can handle already most things, do we wait for BASE version 2? Do we apply for a EU grant to develop such a system?
SAGE data can't be stored in ArrayExpress, can MIAMExpress be used to store this type of data?
Brazma: We potentially accept such data, the problem is more a problem of resources and having an expert at EBI.
Why do we need a separate repository if you will develop a data warehouse?
Brazma: The data warehouse won't be MIAME-compliant and only the high-quality data will be stored there.

4, 5: Integrative data analysis 1 & 2

As data analysis remains a major challenge for microarray analysis, two sessions were devoted to this topic. The first presentation, by Martin Kuiper of the Flemish Institute for Biotechnology (Ghent, BE), was devoted to the Compendium of Arabidopsis Gene Expression project, a large ongoing EU FP5 project aimed at the execution of 4.000 full-genome Arabidopsis microarrays to produce a prototype for a compendium covering a large part of the Arabidopsis transcriptome (different developmental stages, organs, environmental conditions, ecotypes, or mutants). Such a project emphasizes again how big the challenges for microarray data analysis are.

The second presentation, by Ritsert Jansen from the University of Groningen (NL), put forward an intriguing standpoint for microarray analysis towards gene network reconstruction. It proposed to combine the power of classical genetic analysis (such as recombinant inbred lines) with the power of microarray experiment. In such a setup, variations in gene expression are expressed in terms of genotypic variations. By determining the genotypic pattern in each experiment (for example, through allelic markers), the contribution of the different loci to gene expression can be determined, which is a large step towards the reconstruction of gene networks.

Next, the presentation by Jaak from Egeen (Tartu, Estonia) focused on Expression Profiler, a set of tools for gene expression analysis linked to ArrayExpress (developed while at the European Bioinformatics Institute). In particular, tools for cis-regulatory sequence analysis (such as SPEXS) were discussed.

The presentation by Yves Moreau (Katholieke Universiteit Leuven, BE) focused on a generic statistical method called Gibbs sampling, which is a Markov Chain Monte Carlo technique. It was argued that this general technique can be flexibly used in a variety of data analysis problems in bioinformatics. This was illustrated by its application to both the analysis of cis-regulatory sequences (Gibbs sampling for motif finding) and to the (bi)clustering of microarray data. Biclustering of microarray data is particularly relevant in the case of analysis of heteregeneous sets of microarray data, where not every experiment may be relevant to the detection of a relationship between two genes at the transcriptional level.

In the second session on the theme of integrative data analysis, the presentation by Thomas Lengauer (Max Planck Institute, Saarburcken, DE) focused on mapping gene expression data onto pathways for the detection of expression patterns at the level of the pathway. Such an approach increases the interpretability of microarray data by mapping it at a level (i.e., pathways) that is more meaningful to biologists.

The next presentation by Iftach Nachman from Hebrew University (Jerusalem, IL) concentrated on the learning of regulatory networks using probabilistic graphical models (such as Bayesian networks and probabilistic relational models). A semi-physical model of transcriptional regulation was introduced and translated into a probabilistic model that simultaneously involved the cis-regulatory sequence and the expression data. Methods for learning such models from data were presented.

The next presentation by Frank Holstege from UMC Utrecht (NL) presented methods for the combination of multiple types of genomic data, such as expression data and protein-protein interaction data to reconstruct gene networks. He also discussed to what extend the different types of data were correlated.

Finally, Irit Gat-Viks from Tel-Aviv University (IL) presented a set of problems ranging from clustering to gene network reconstruction directly involved in the integrative analysis of microarray data and discussed methods for their solution based on discrete mathematics (such as graph theory methods).

A number of themes emerged from the presentations and the discussions. First, and on a sobering note, it was acknowledged that, regardless of the amount of high-throughput data available, ab initio reconstruction of entire gene networks is likely to be impossible. There is therefore a need for methods that focus on smaller modules within the network and that incorporate prior knowledge of the problem (this was for example illustrated in the presentations by Thomas Lengauer en Iftach Nachmann). Second, it appears that the analysis of microarray data is about to change radically. While biologists have so far mostly focused on the analysis of their own data (i.e., produced in the course of an own project), the exponential growth in the amount of publicly available data means that shortly gene expression studies will systematically start with an analysis of the data available from compendia (such as the CAGE project) and from repositories (such as ArrayExpress). This primary analysis will generate relevant biological hypotheses that will be then validated through follow-up experiments. New tools are needed within this changing context (as illustrated in the presentations by Martin Kuiper, Jaak Vilo, and Yves Moreau).

Highlights and Abstracts from the meeting made available by Martin Kuiper are availbale here.