|Standardisation: the most difficult flower to grow.|
The PSI (Proteomics Standard Initiative) 2014 Meeting was held this year in Frankfurt (13-17 of April) and I can say I’m now part of this history. First, I will try to describe with a couple of sentences (for sure I will fai) the incredible venue, the Schloss Reinhartshausen Kempinski. When I saw for the first time the hotel, first thing came to my mind was those films from the 50s. Everything was elegant, classic, sophisticated - from the decoration to a small latch. The food was incredible and the service is first class from the moment you set foot on the front step and throughout the whole stay.
Standardization is the process of developing and implementing technical standards. Standardization can help to maximize compatibility, interoperability, safety, repeatability, or quality. It can also facilitate commoditization of formerly custom processes. In bioinformatics, the standardization of file formats, vocabulary, and resources is a job that all of us appreciate but for several reasons nobody wants to do. First of all, standardization in bioinformatics means that you need to organize and merge different experimental and in-silico pipelines to have a common way to represent the information. In proteomics for example, you can use different sample preparation, combined with different fractionation techniques and different mass spectrometers; and finally using different search engines and post-processing tools. The diversity and possible combinations is needed because allow to explore different solutions for complex problems. (Standarization in Proteomics: From raw data to metadata files).
|HUPO-PSI 2014 Venue: Kempinsky Echerback Hotel.|
Proteomics Standard Initiative formally started in 2002 (they have more than 12 years). Since the first manuscript published by the group (Meeting Review: The HUPOProteomics Standards Initiative meeting: towards common standards for exchangingproteomics data), they addressed major challenges in this topic for the community:
“There was a remarkable consensus between delegates attending the PSI meeting to the effect that valuable data would be lost without public repositories and common interchange formats making information accessible to the scientific community… All such efforts require support from the user community and from the scientific press and funding agencies.”
The HUPO-PSI consortium has been working in in four major groups: (i) Molecular Interactions, (ii) Mass Spectrometry, (iii) Proteomics Informatics, (iv) Protein separations. From my point of view, the major results that were obtained under the PSI umbrella were:
- Definition guidelines and Control Vocabularies to report Proteomics and molecular interactions data .
- Development of PSI standard file formats (mzML, mzIdentML, mzQuantML, qcML, mzTab, PSI-MI, MITAB).
- Implementation of different resources and tools for standardization, visualization and sharing of proteomics data (PRIDE, Intact, Reactome, PRIDE Inspector, PRIDE Converter, ProteoWizard, etc)
Description of major outcomes and results
Guidelines and Control Vocabularies to report Proteomics and molecular interactions data: The minimum information about experiments  series is a collection of manuscripts and guidelines to encourage the standardised collection, integration, storage and dissemination of proteomics data, the HUPO-PSI develops guidance modules for reporting the use of techniques such as gel electrophoresis, mass spectrometry and protein interaction networks.
The MIAPE and MIMIx Guidelines are divided in various modules:
· Study design and sample.
· Experimental motivation and design; factors of interest; origin and preprocessing of biological material; numbers of replicates; relationship to other studies; miscellaneous administrative detail.
· Separations and sample handling.
· Column chromatography
· Capillary electrophoresis.
· Mass spectrometry.
· Informatics for mass spectrometry.
· Gel electrophoresis.
· Gel image informatics.
· Protein and peptide arrays.
· Statistical analysis of data
· Molecular interaction experiments.
|Different paths and ideas, but only those well supported|
and structured are successful.
In black are the more successful modules in terms of data standards, resources, tools and benefits provided to the proteomics community (from my point of view). These modules demonstrated the importance of having a good idea, a progressive field and a powerful community behind. The mass Spectrometry and Informatics for mass spectrometry modules have been led by PeptideAtlas and PRIDE groups amount others. These groups have relied their pipelines, data and tools in the progress of the controlled vocabularies, standards and guidelines for data publication and dissemination. The molecular interaction module has been a cornerstone of the development of the Intact database (http://www.ebi.ac.uk/intact/) and PSIQUIC (https://code.google.com/p/psicquic/). Some notes from the meeting and current status of each module:
The mass spectrometry guidelines have been guided the development of standards for MS/MS representation and the final development of mzML. MzML is still under active development; advances in technologies provide new challenges, which need to be met by these standard, including the application of mzML to metabolomics, SWATH-MS and other data-independent acquisition workflows, and ion mobility MS. mzML is suitable for metabolomics with only the addition of new CV terms required to meet the current needs of this community. To tackle the issue of data compression, the use of mgzip combined with a new compression method MS-Numpress will yield mzML files that are often smaller than vendor files.
|mzML all about compression using|
MS-Numpress and mgzip
mzML is a mature file format because it can represent chromatography information and MS information in the same file and with the new improvements the size of the file has been decreased considerably compare with its competitors mz5, mzXML, etc. The mass spectrometry community still has some challenges for the future with the evolution of some topics such as ion-mobility and DIA (data independent acquisition). Issues remaining to be resolved by this group include deciding the means by which synthesized MS2 spectra acquired from MSE, i.e. a data-independent approach that acquires MS1 and MS2 mass spectra in an unbiased and parallel manner, and also how merged, clustered spectra should be captured in mzML.
Informatics for mass spectrometry guidelines has been involved in the development and implementation of standards to represent the process of identification, quality assessment and post-processing of mass spectrometry data. Apart of the ontologies and the important work done in standardization the main output of this group is the release of mzIdentML. mzIdentML was released in 2012 and is the successor of pepXML and protXML file formats. The mzIdentML standard for peptide/protein identification also requires some updates to meet the needs of protein grouping including statistical thresholds for protein groups, support for peptide-level statistics, support for the use of multiple search engines in mzIdentML and the first support for chemical cross-linking studies.
|mzTab (Laurel) & mzIdentML(Hardy)|
Recently this group developed the mzTab, a lightweight file format for peptide/protein identification and quantitation. mzTab has been used for proteomics but also for metabolomics. As a bioinformatician and user of mzTab, I really like the general way of modelling quantitation experiments. As a developer, I like the simplistic way to represent different complex data in one format; is really simple in terms of data structure and easy to learn. Also the size of one complete experiment should be 1/20 of a mzIdentML file. Both file formats are now supported by PRIDE Inspector to check the quality of Proteomics Experiments.
Molecular interaction experiments module has been working in the development of molecular interaction resources, standards, controlled vocabularies, etc. The main work of the molecular interaction in this meeting was related with PSI-MI XML standard to enable:
· The ability to exchange “abstracted” data, i.e. knowledge built from experimental data such as protein complex composition/topology and cooperative interactions.
· The ability to add information on dynamic interactions.
· The requirement to capture the causality of molecular interactions needs further discussion with external groups to ensure we have adequate data capture and in an appropriate for- mat.
The group then discussed the development of the JAMI Java application programming interfaces. JAMI is a single Java library designed to unify both the MITAB and PSI-XML standard formats by providing a common Java framework, while hiding the complexity of both from the naive user similar to other proteomics libraries such as ms-data-core-api (https://github.com/PRIDE-Utilities/ms-data-core-api). Once the first version of the JAMI core data model has been released, subsequent tool development will be made easier as tools need not be format specific.
On the last day, a complete session was dedicated to ProteomeXchange: major results and future challenges. Some of the partners involved in ProteomeXchange gave talks about their resources and tools. The advances and future developments in resources such as PRIDE, PeptideAtlas and MassIVE were presented and discussed in details. These three resources are the main partners of the consortium at the moment. Robert Chalkley also gave a really nice introduction to MS-viewer: a web-based spectral viewer for proteomics results.
These are some of my quick notes and also some documentation from "Meeting New Challenges: The 2014 HUPO‐PSI/COSMOS Workshop." HUPO-PSI is history and I was part of it. Future challenges have emerged from discussions and new ideas. I met really nice people during the meeting, guys that made our life easy in the lab.
 Taylor CF, Paton NW, Lilley KS et al. The minimum information about a proteomics experiment (MIAPE). Nat Biotechnol. 2007 Aug;25(8):887-93.