|
Development of standards-compliant tools for molecular interaction data management
16-19 November 2008
Hinxton, UK
Organiser:
Henning Hermjakob, EMBL-EBI, Hinxton, UK
Draft
Report
Summary
Molecular interaction data is a key resource in modern biomedical research, but interactions are often presented in different data formats and managed in different protocols.
The HUPO Proteomics Standards Initiative has developed the PSI MI XML format, a community standard for molecular interactions that is now widely implemented, and has recently been complemented by a simplified tabular format (MITAB 2.5). The aim of the workshop was to provide intensive hands-on training in the development of standards-compliant software for molecular interaction data management.
In particular, we defined the PSI Common Query Interface (PSICQUIC), an interface (API) for direct computational access to molecular interaction data resources. As of April 2010, the PSICQUIC interface is implemented by nine international molecular interaction data resources, providing about 1.6 million binary interactions. Scientific
Content
Molecular interactions are key to our understanding of biological systems, and there are many experiment techniques by which they are determined. These range from hypothesis-driven experiments analysing details of specific binding regions, to large scale experiments determining thousands of interactions in a systematic, genome-wide manner. A multitude of databases aim to capture molecular interactions, ranging from author-maintained resources reporting only laboratory data, to curated general databases combining literature curation with direct data submissions. Composite databases attempt to represent as many interactions as possible by integrating data from multiple other databases. In addition, many organisations maintain in-house interaction data resources, often containing proprietary data. This plethora of resources, whilst rich in data, presents the research worker interested in a specific biological domain with a significant data access challenge.
Prior to 2004, most of the existing resources released data in their own specific formats, making the difficult task of data integration still harder. However, in that year the HUPO Proteomics Standards Initiative defined the PSI- MI XML format, a community standard for the representation of molecular interaction data accompanied by extensive controlled vocabularies for detailed description of molecular interactions. The current version PSI-MI XML2.5 also defines a simplified, but still standardised tabular format for molecular interactions, the MITAB format. PSI MI formats are now broadly accepted and widely implemented, for example by BioGRID, DIP, IntAct, MINT, MPIDB, and supported by key tools like the Cytoscape visualisation and analysis system.
The first aim of the workshop was to provide an introduction to the PSI-MI XML and tabular format, and the broad range of open software surrounding the format. This aim was implemented through a series of presentations on day 1.
The PSI formats significantly facilitate the integration of molecular interaction data from multiple sources, either by end users, or by dedicated composite databases such as APID, iRefIndex, or STRING. However, even with more consistent input data, maintaining downstream resources up to date remains a significant challenge. In addition, composite databases are sometimes not allowed to redistribute data from specific resources due to restrictive licenses.
To instantaneously access multiple interaction data resources, a common interface for computational access allowing software clients to interact with multiple sources using the same interface is required. Faced with a similar challenge, the integration of genomic feature annotations from many different sources, Stein et al developed the Distributed Annotation System (DAS), in which a usually web-based client integrates genomic feature annotations from potentially many servers, relative to a defined reference genome sequence. Currently, the central DAS registry lists hundreds of servers for multiple different data types.
The second major aim of the workshop was the adaptation of the distributed database access concepts pioneered by DAS to the domain of molecular interactions, based on the PSI MI data format standards. Prior to the workshop, the HUPO PSI MI work group had started to outline PSICQUIC, the PSI Common Query InterfaCe, a community standard for computational access to molecular interaction resources. The status of the draft specification was described by Bruno Aranda from the EBI. Days two and three of the workshop were dedicated to the joint finalisation of the PSICQUIC interface definition, followed by development of prototype implementations of the interface based on existing PSI MI libraries, and existing software provided by the different participants. Figure 1 shows an idealised view of the PSIQUIC principle.

Figure 1: Idealised view of the PSICQUIC concept: A sample is observed by multiple independent experiments, each yielding a partial view of the real interaction network under observation. Resulting data is captured independently by several interaction databases. Through use of the consistent PSICQUIC query interface, web-based client software can relatively easily access all relevant interaction data sources, ideally re-composing an almost complete view of sample under consideration.
Assessment of the results & impact of the event
As described above, on the simplest level, PSICQUIC provides a set of methods to simultaneously query multiple molecular interaction databases. A typical query may request all the interactions involving a specific interactor (typically a protein), as defined by a database accession number, for example UniProtKB, or all interactions involving at least one of a set of interactors. PSICQUIC servers may return data in one or more output types, usually the tabular MITAB or the more detailed PSIMI XML format. The available output types can be queried, and the required output type set as a query parameter.
Similarly, a server may use one or more different reference systems for interactors, for example UniProtKB and/or RefSeq. PSICQUIC offers methods to query available reference systems, and choose servers accordingly. To overcome the incompatibilities of different protein sequence reference systems, we have agreed to also offer iRefIndex (Presented by Sabry Razick) identifiers for proteins, which depend only on the protein sequence.
PSICQUIC output data sets can potentially be very large, and might cause long response times, timeouts, and excessive memory usage on both server and client. Therefore, PSICQUIC offers methods for pagination, the retrieval of result sets in multiple chunks of manageable size. For the full set of PSICQUIC methods, please refer to http://code.google.com/p/psicquic/ .
The default PSICQUIC methods described above provide a basic, but highly functional interface covering many use cases. However, they do not allow complex queries. Therefore, we have defined MIQL, a flexible query language for use through the PSICQUIC interface. MIQL is based on standard Lucene syntax and offers single word or phrase queries (abl1 AND “pull down”), search in specific attributes/columns (abl1 AND species:human), wildcards (abl*), and logical operators. For a full description of MIQL and currently defined attribute/column names, please see the above PSICQUIC description on SourceForge.
As of April 2010, the PSICQUIC interface has been implemented by 9 major interaction data sources. These comprise not only protein-protein interaction data sources, but also protein-small molecule interactions (ChEMBL) and simplified pathway data from Reactome. Figure 2 shows the current status of the PSCQUIC registry, listing PSICQUIC implementations, and the number of interactions they provide as of April 2010, about 1.6 million. The broad adaptation of PSICQUIC has clearly been supported by the workshop, which involved representatives from most of these resources at a very early stage of the definition of the interface.

Figure 2: PSICQUIC registry as of April 2010.
The PSICQUIC open source project also provides a PSICQUIC reference implementation, which allows rapid setup of a PSIQUIC server with limited effort. PSICQUIC clients get the list of available resources online from the PSICQUIC registry. Therefore, registration of a new server is all that is necessary to make the new resource accessible to all clients without further programming. Following the successful example of the DAS registry, PSICQUIC sources are annotated, to allow classification of data sources according to data types such as protein-protein, or drug-target interactions.
An open source Java client library is provided at the same website. The simplicity of the PSICQUIC interface has already resulted in a number of client applications providing simple to access data for the bench biologist. For example, a simple PSICQUIC client provides a tabulated overview of result data from multiple sources for a given user query (http://www.ebi.ac.uk/
Tools/webservices/psicquic/view/main.xhtml). Additionally, as of version 2.7.0 , the popular Cytoscape visualisation tool (represented at the workshop by one of the core developers, Kei Ono), can import PSICQUIC data and enables a query to be made from within Cytoscape across all participating resources. The PSICQUIC interface can also be incorporated into existing data resources – the current version of the IntAct molecular interaction database does not only provide the internal results for a given query, but also shows a direct link to the external search result (http://www.ebi.ac.uk/intact). Similarly, the current Reactome beta version provides manually laid out pathway diagrams, and for each protein in a diagram, known interactors from any given PSICQUIC source can be visualised in a graphical overlay (see figure 3). This allows to users to not only visualise additional proteins which may influence or be influenced by the described pathway, but also by using the ChEMBL drug-target database PSICQUIC interface, to identify small molecules which are capable of perturbing this same pathway.

Figure 3: Reactome beta version, April 2010. Manually laid out pathway diagrams can be overlaid with molecular interaction data (in blue). Data is received via the PSCQUIC interface, which allows easy switching between different interaction data sources.
PSICQUIC now provides access to a broad range of molecular interaction data sources. Current clients retrieve the data from one of more of these sources, but do not yet integrate them. We ahve started to provide a clustering service for PSICQUIC data from multiple sources. Programme
Sunday, November 16, 2008
13:00 Introduction – Henning Hermjakob, EBI , UK
13:20 The PSI MI XML 2.5 standard – Arnaud Ceol, U Tor Vergata, Rome , Italy
13:40 PSI tools: Java API, validator – Samuel Kerrien, EBI , UK
14:00 PSI tools: RpsiXML - Jitao David Zhang, DKFZ, Germany
14:20 The PSIQUIC common query interface – Bruno Aranda, EBI , UK
14:45 Coffee
15:00 Definition of development targets and developer teams
Hands-on software specification/development in small groups
17:00 Short status summary
19:00 Meeting dinner Hinxton Hall.
Monday, November 17, 2008
09:15 Cytoscape as a web services client - Keiichiro Ono, UCSD, USA
10:00 The PSISCORE molecular interaction confidence scoring framework – Hagen Blankenburg, MPI, Germany
10:20 The iRefIndex system – Sabry Razick
10:40 Definition of development targets and developer teams
Hands-on software specification/development in small groups
12:30 Lunch
13:30 Hands-on software development in small groups
15:00 Coffee
17:00 Short status summary
19:00 Meeting dinner, Red Lion, Hinxton
Tuesday, November 18, 2008
09:15: Definition of development targets and developer teams
Hands-on software development training in small groups
12:30 Lunch
13:30 Hands-on software development in small groups
15:00 Coffee
17:00 Short status summary
Wednesday, November 19, 2008
09:15: Definition of development targets and developer teams
Hands-on software development training in small groups
12:30 Lunch
13:30 Summary session: Results and further planning of development and dissemination (publications, updates of http://www.psidev.info )
15:00 Meeting end
|