Notes from Abstract
Proteomic mass spectrometry is a method that enables sequencing of gene product fragments, enabling the validation and reﬁnement of existing gene annotation as well as the detection of novel protein coding regions. However, the application of proteomics data to genome annotation is hindered by the lack of suitable tools and methods to achieve automatic data processing and genome mapping at high accuracy and throughput.
In the ﬁrst part of this project I evaluate the scoring schemes of “Mascot”, which is a peptide identiﬁcation software that is routinely used, for low and high mass accuracy data and show these to be not suﬃciently accurate. I develop an alternative scoring method that provides more sensitive peptide identiﬁcation speciﬁcally for high accuracy data, while allowing the user to ﬁx the false discovery rate. Building upon this, I utilise the machine learning algorithm “Percolator” to further extend my Mascot scoring scheme with a large set of orthogonal scoring features that assess the quality of a peptide-spectrum match.
To close the gap between high throughput peptide identiﬁcation and large scale genome annotation analysis I introduce a proteogenomics pipeline. A comprehensive database is the central element of this pipeline, enabling the eﬃcient mapping of known and predicted peptides to their genomic loci, each of which is associated with supplemental annotation information such as gene and transcript identiﬁers.
In the last part of my project the pipeline is applied to a large mouse MS dataset. I show the value and the level of coverage that can be achieved for validating genes and gene structures, while also highlighting the limitations of this technique. Moreover, I show where peptide identiﬁcations facilitated the correction of existing annotation, such as re-deﬁning the translated regions or splice boundaries.
Moreover, I propose a set of novel genes that are identiﬁed by the MS analysis pipeline with high conﬁdence, but largely lack transcriptional or conservational evidence.