Thursday, 9 June 2016

How to estimate and compute the isoelectric point of peptides and proteins?

By +Yasset Perez-Riverol and +Enrique Audain :

Isoelectric point (pI) can be defined as the point of singularity in a titration curve, corresponding to the solution pH value at which the net overall surface charge is equal to zero. Currently, there are available different modern analytical biochemistry and proteomics methods depend on the isoelectric point as a principal feature for protein and peptide characterization. Peptide/Protein fractionation according to their pI is widely used in current proteomics sample preparation procedures previous to the LC-MS/MS analysis. The experimental pI records generated by pI-based fractionation procedures are a valuable information to validate the confidence of the identifications, to remove false positive and and could be used to re-compute peptide/protein posterior error probabilities in MS-based proteomics experiments. 

Theses approaches require an accurate  theoretical prediction of pI. Even thought there are several tools/methods to predict the isoelectric point, it remains hard to define beforehand what methods perform well on a specific dataset.  

We believe that the best way to compute the isoelectric point (pI) is to have a complete package with most of the algorithms and methods in the state of the art that can do the job for you [2]. We recently developed an R package (pIR) to compute isoelectric point using long-standing and novels pI methods that can be grouped in three main categories : a) iterative, b) Bjellvist-based methods and c) machine learning methods. In addition, pIR also offers a statistical and graphical framework to evaluate the performance of each method and its capability to “detect” outliers (those peptides/protein with theoretical pI biased from experimental value) in a high-throughput environment.

First lets install the package:

First, we need to install devtools:
install.packages("devtools")
library(devtools)
Then we just call:
install_github("ypriverol/pIR")
library(pIR)


How to compute peptide/protein isoelectric point using the SVM method using pIR?

Currently, pIR contains several support vector machines (SVM-based) pre-trained models that can be used to predict the isoelectric point from any amino acid sequence [1]. This models has been trained using different dataset at different pH range. Thus,  the accuracy in the prediction will be strongly depend of the chosen method.  To choose a specific model, the option available are the following:
  •         default (3.0-10.0 pH, ~7000 features) (1)
  •         heller (4.0-6.0 pH, ~6000 features) (2)
  •      branca (3.15-9.15 pH, ~69000 features) (3)

Additional details about the dataset used to build each svm-model can be seen in the references.

To compute the pI of a single sequence, use the command bellow:

----------------------------------------------------------------------------------------

> pI <- pISVMpeptide(sequence = "AGAAPYVQAFDSLLAGPVAE", model="default")
> #the result will be...
> pI
[1] 4.015568[1] 

----------------------------------------------------------------------------------------

> pI <- pISVMpeptide (sequence = "AGAAPYVQAFDSLLAGPVAE", model="heller")
> #the result will be...
> pI[2] 
[1] 4.178931

----------------------------------------------------------------------------------------
> # using the SVM model branca
> pI <- pISVMpeptide (sequence = "AGAAPYVQAFDSLLAGPVAE", model="branca")
> #The result will be...
> pI
[1] 4.049737
----------------------------------------------------------------------------------------

Most of the tools allow to compute the isoelectric point (example above) to individual sequences. However, to predict the isoelectric point of a list of peptides/proteins (for example, a short list of sequences contained in a file) the following option could be used:

----------------------------------------------------------------------------------------
#reading sequences from any file
>peptides <- read.table (file=pepFile, header = TRUE, sep = "\t")

#rename column name
>colnames (peptides) <-c("sequence")
 
#computing pI using SVM method with default option.
> peptides <- mdply (peptides, function (sequence) {pISVMpeptide(sequence = sequence, model="default")})

>colnames (peptides) <-c ("sequence", “pIsvm”)
----------------------------------------------------------------------------------------

Even though pIR supports the isoelectric point prediction using different pre-builded SVM models, it is possible to train a new model from a more complex dataset (for example, up to 100000 sequences). It must contains the experimental pI values associated to each sequence. This could be a time-consuming process depending of the size of dataset. The results in the prediction will be strongly related with the quality of the dataset supplied. Bellow, the steps to perform this analysis:

----------------------------------------------------------------------------------------
#reading sequences from any file
peptides <- read.table(file="data/svmDataDefault.csv", header = TRUE, sep = ",")

#rename column name
colnames(peptides) <-c("sequence", "pIExp")

#building dataset with features
data <- svmPIBuildData(originalData = peptides)

#Optionality, to compute pI using SVM method with default option.
data <- pISVMsequences(df=data, model = "default", newModel = FALSE)
 
#Build a new model from the current dataset by choose newModel flag = TRUE, and predict new values.
data <- pISVMsequences(df=data, newModel = TRUE)
----------------------------------------------------------------------------------------

An Iterative algorithm to predict the isoelectric point value works following three major steps. Firstly, it count the numbers of copies of the amino acids which play a role in determining pI and to propose pH values. Second, it determines the expected charge on the polypeptide for a particular pH value. And finally, it determines the expected proportion of charged and uncharged side chains for a particular amino acid, which are assessed from the supplied pK values (ref). There are several pK sets defined that could be used according to experimental setting evaluated. PIR supports the isoelectric point prediction based on Iterative method by choose the current pK set available from literature. Bellow some examples:

-----------library(pIR)

seq <- "GLPRKILCAIAKKKGKCKGPLKLVCKC"
pI <- pIIterative(sequence = seq, pkSetMethod = "solomon")
print(pI)
#The result will be 10.526

#Computing all pI values using Iterative method.
> sequence <- "AADCEVEQWDSDEPIPAK"
> pIvalues <- computeAllIterativeValues(seq = sequence)
#The result will be...
> pIvalues
           method values
1         solomon 3.4161
2         rodwell 3.3749
3          emboss 3.5322
4       lehninger 3.3711
5        grimsley 3.3012
6      patrickios 3.4220
7       DtaSelect 3.7848
8        toseland 3.3571
9       thurlkill 3.4784
10 nozaki_tanford 3.6445
-------------------------------

Scaling-up the isoelectric point prediction

The first idea with pIR was to benchmark the state-of-art isoelectric point prediction algorithms and compare its performance. Let’s take the Branca’s dataset like example to perform a comparative analysis. Since this dataset was collected using a high resolution isoelectrofocusing technology, it puts a high quality data out to do this kind of analysis. 

First, use to computePIValues function to predict all isoelectric point variable. This function take a data.frame in the way of: sequence pIExp, and return a new data.frame with all pI values computed using all methods available in pIR. This operation could be time-consuming depending of the size of the dataset.


#load dataset
> ruiBrancaDataSet <- read.csv("~/ruiBrancaDataSet.csv")
#compute pI values
> branca <- computePIvalues(ruiBrancaDataSet)
#view output
> head(branca, n=10)

           sequence experimental calibrated expasy skoog solomon rodwell emboss lehninger grimsley patrickios DtaSelect toseland  [...truncated]
       AADCEVEQWDSDEPIPAK        3.615      3.712  3.712 3.359   3.416   3.375  3.532     3.371    3.301      3.422     3.785    3.357
      AADPPAENSSAPEAEQGGAE        3.615      3.703  3.446 2.776   2.896   3.118  3.257     2.844    3.120      3.501     3.314    3.144
          AADSICDGDLVDSQIR        3.615      3.770  3.770 3.414   3.467   3.383  3.662     3.426    3.283      3.598     3.987    3.338
         AADTDGDGQVNYEEFVR        3.615      3.769  3.769 3.418   3.474   3.421  3.606     3.430    3.343      3.501     3.872    3.397
     AAEEAFVNDIDESSPGTEWER        3.615      3.714  3.714 3.364   3.423   3.409  3.495     3.376    3.358      3.723     3.714    3.413
       ADAEDLLDSFLSNILQDCR        3.615      3.710  3.710 3.353   3.409   3.340  3.575     3.366    3.246      3.501     3.872    3.303
             ADAEGESDLENSR        3.615      3.834  3.834 3.492   3.548   3.515  3.638     3.502    3.457      3.899     3.872    3.510
 ADDPSSYMEVVQAANTSGNWEELVK        3.615      3.834  3.834 3.492   3.548   3.515  3.638     3.502    3.457      3.501     3.872    3.510
               ADEEPTPADGR        3.615      3.917  3.917 3.578   3.632   3.585  3.738     3.588    3.523      4.024     3.987    3.574


Visualizing results with pIR

Correlation

> rawCorr <- plotRawCorrelation(branca)
> png("peptideRawCorrelation.png", width = 800, height = 800)
> multiplot(plotlist = rawCorr, cols=3)
> dev.off()

 
Figure 1. Correlation between theoretical and experimental isoelectric points using 14 different pI method on Branca dataset.

Histogram

> histogram <- plotHistFunc(branca)
> png("Distributions.png", width = 800, height = 800)
> multiplot(plotlist = histogram, cols=3)
> dev.off()


Figure 2. Distribution of isoelectric point for different methods and the experimental distribution. The y-axis is the number of peptides for the corresponding isoelectric point value (x-axis).



Plotting outliers dispersion

> outliers_dis <- plotOutlierOverall(branca)
> png("OutlierDispersion.png", width = 800, height = 800)
> multiplot(plotlist = outliers_dis, cols=3)
> dev.off()


Figure 3. Dispersion graphs of outliers (green) and non-outliers (red) peptides on the Branca dataset using different isoelectric point prediction algorithms. Outlier detection criteria: Abs (experimental – predicted) >= 2*RMSE.


Outliers distribution

> outliers <- plotOutlierDistribution(branca)
> png("OutlierDistribution.png", width = 800, height = 800)
> multiplot(plotlist = outliers, cols=3)
> dev.off()

Figure 4. Distribution of outliers and non-outliers populations of the Branca dataset. An outliers is defined if: Abs (pIexperimental – pItheoretical) ≥ 2*RMSE.


Visualizing fractions



> fractions <- plotFractionCorrelation(branca)
> png("FractionCorrelation.png", width = 800, height = 800)
> multiplot(plotlist = fractions, cols=3)
> dev.off()

Figure 5. Experimental vs. theoretical isoelectric point for different peptide fractions of an in-gel electrophoresis experiment (Branca dataset).


References



3. Branca, R.M., et al. HiRIEF LC-MS enables deep proteome coverage and unbiased proteogenomics. Nature methods 2014;11(1):59-62.