Thursday, 13 August 2015

The future of Proteomics: The Consensus


After the Big Nature papers about the Human Proteome [1][2] the proteomics community has been divided by the same well-known topics than genomics had before: same reasons, same discussions [3-7]. No one discusses about the technical issues, the instrument settings, nothing about the samples processing, even anything about the analytical method (Most of both projects are "common" bottom-up experiments). Main issues are data-analysis problems and still Computational Proteomics Challenges.  


The first analysis of the data was performed by Ezkurdia et. al, a clear biological statement on top of the two drafts:

We decided to carry out a simple quality test on the data using the olfactory receptor family. Olfactory receptors are seven trans-membrane helix receptors that trigger the olfactory signal transduction pathway. These receptors first appeared in vertebrates and have duplicated to such an extent that mammalian species possess many hundreds of these genes. .. The functional specificity of these genes indicates that expression is predominantly limited to a single tissue, although the mouse orthologue of OR51E2 has been convincingly shown to have a function in the kidney and the Human Protein Atlas records limited RNA evidence for the expression of olfactory receptors outside of the nose. Olfactory receptors have very little transcript expression and should be particularly difficult to detect in proteomics experiments because they are trans-membrane proteins. 
A high quality proteomics experiment that does not include a specific analysis of nasal tissues should not expect to detect much evidence of peptide expression for these genes. For example, PeptideAtlas [7], known for having high stringency criteria, identifies just 2 discriminating olfactory receptor peptides. As far as we know neither of the studies carried out experiments on nasal tissues. We found peptide evidence for 108 of these olfactory receptors in the Human Proteome Map database and another 200 olfactory receptors are recorded in ProteomicsDB. 
There are at least three reasons for the high numbers of olfactory receptors in the two studies. First neither experiment properly distinguishes between discriminating and non-discriminating peptides, so olfactory receptors are identified by peptides that map to more than one gene (40 of the olfactory receptors detected in the Pandey study were identified solely by non-discriminatory peptides).  Second, a number of peptides were wrongly identified as having a glutamine to pyroglutamic acid modification in non N-terminal positions. Third, both studies include very many low quality spectra.
As simple as it is, they took a couple of proteins as hypothesis-driven (I will go into this topic in my thoughts) and using them as quality control they found wrong identifications; by manually looking inside the data, they found poor quality spectra identified with poor scores.  

After that, Omens et al. described and compared the results in details: 
There are some major surprises in the Kim et al study. First, quite lax filters were employed to control false-discovery, 1% for 25 million PSM, 1% for 293,000 peptides, and no FDR filter at all for proteins. A match with either search engine was considered sufficient. Second, a minimal peptide length of 6 amino acids was accepted. Many protein matches were based on a single peptide. Third, no comparative analysis using standard HPP metrics was employed, nor were unlikely identifications scrutinized closely for alternative explanations of the spectral and peptide matches.  
In Wilhelm paper, they used 1% FDR for 1.1 billion PSM, 5% FDR for peptides (minimal length 7 amino acids), and no FDR for proteins, though they did deploy an early version of their picked target-decoy approach comparing pairs of observed and decoy sequences, now published. They chose to protect against true-positives being removed... Andromeda and MASCOT were used as search engines; a “hit” with either was considered sufficient to claim a protein match.
Finally, the Kuster lab published a reassessment of their own Wilhelm et al analysis. First they claimed that the classical target-decoy method for protein FDR eliminates a very high percentage of true positives in large heterogeneous datasets; however, no such phenomenon has been seen in PeptideAtlas or in GPMDB. More remarkably, their re-analysis of their own data in ProteomicsDB yielded only 14,714 (instead of 18,097 reported by Wilhelm et al), only modestly more than the classic FDR method (even without Mayu adjustment) at 14,035. When they compared with methods for the combined datasets (Kim et al, Wilhelm et al), the results were 15,375 versus 14,638.
The reanalysis of the both datasets by two well-established pipelines (GPMDB/PeptideAtlas) show less evidences (proteins/peptides) highlighting one of the common bioinfo problems different pipelines -> different results (in some cases completely different).


The latest paper (but I'm sure it will not be the last) in the discussion is the Serang and Lukas Käll paper. More provocative and proactive for the Computational Proteomics community, with an exciting title "The solution to statistical challenges in proteomics is more statistics, not less" the paper explain in detail the problem of not considering the FDR at protein level in such studies, something that has been discussing extensively in the community and remains a challenge. The authors explain (for dummies) how the error is propagated at the different levels in proteomics: PSM->Peptide->Protein->Quantitation


Finally (and before my thoughts about this history), Alexey proposes (in an excellent review) a couple of suggestions about how "claim" for new findings in proteomics, especially for proteo-genomics experiments but they can be applied to gib-datasets:

  • Protein databases should be made available - through existing mechanism (ProteomeXchange allow FASTA Files.)
  • Even if you use a common database for your peptides, the "new findings" needs to be mapped to common databases such as UniprotKB, RefSeq or ENSEMBL
  • FDR for "novel" peptides needs to be described and applied in different way than for know peptides.
  • The same peptide should not be used as evidence for multiple different proteins or protein forms. 
   
My thoughts:

These papers are incredible achievements for the proteomics community for different reasons: the number of tissues analysed, the amount of data they produce, etc ... but also they exposed most of our major problems... 
  •  (i) First, the race for the Human Proteome or a Human Genome or the Human Interactome, etc .. only produce (in my opinion) an unproductive/inaccurate/fancy way of presenting the results that can only be accepted by two reviewers without tools for data screening in real time and without a proper way to analyse these results. 
  • Any exploratory analysis (especially proteo-genomics) in Proteomics should be followed or at least corroborate with other approaches (hypothesis-driven MRM, immunochemistry, etc), specially the "new findings"
  • We need to keep in mind that the PSM and Peptide-identification is the base of the building. In top of that all the biological knowledge, quantitation values, protein interaction analysis, pathway analysis, etc. Consensus scores between different search engines, FDRs at PSM, Peptide and Protein level only will produce high-confidence identifications and reliable results in all levels. Most of these errors are then propagated to knowledge databases such as interaction, pathway, structure, or sequence databases generating an infinite and exponential error loop.   
Only the Consensus of the community will be able to provide reliable results. The future of this community and big papers like these should be in the direction
  • Providing the data in archives (RAW, peak files, results files, FASTA) 
  • Expose the claiming results in databases through APIs or Files.
  • All pipelines need to be published and properly tested and compared by the community before the data claims. In this sense the TPP and the GPM are good examples. 
  • The analysed results need to be the consensus among different pipelines, software tools and databases.  
  • Run these big analysis inside consortiums and collaborative environments such as the recently published project "Proteogenomic characterization of human colon and rectal cancer (CPTAC)" a project with more than 75 authors including collaborators and more than 13 departments. 

References

[1] Kim, Min-Sik, et al. "A draft map of the human proteome." Nature 509.7502 (2014): 575-581.
[2] Wilhelm, Mathias, et al. "Mass-spectrometry-based draft of the human proteome." Nature 509.7502 (2014): 582-587.
[3] Ezkurdia, Iakes, et al. "Analyzing the first drafts of the human proteome."Journal of proteome research 13.8 (2014): 3854-3855.
[4] Nesvizhskii, Alexey I. "Proteogenomics: concepts, applications and computational strategies." Nature methods 11.11 (2014): 1114-1125.
[5] Omenn, Gilbert S., et al. "Metrics for the Human Proteome Project 2015: Progress on the Human Proteome and Guidelines for High-Confidence Protein Identification." Journal of proteome research (2015).
[6] Deutsch, Eric W., et al. "The State of the Human Proteome in 2014/2015 as viewed through PeptideAtlas: enhancing accuracy and coverage through the AtlasProphet." Journal of proteome research (2015).
[7] Oliver Serang and Lukas Käll. The solution to statistical challenges in proteomics is more statistics, not less. Journal of proteome research (2015).