Wednesday, 25 April 2012

Perl Proteomics & InSilicoSpectro

In contrast with genomics, bioinformaticians in proteomics don’t have a "big" and "complete" perl library for proteomics data analysis. It could be related with the "heterogeneity" in proteomics. A lot of different instruments, protocols, properties. Also genomic have a huge community (bioinformaticians) and standardize tools (instruments and software’s). In 2006 Collinge and Masselot published an open-source perl library named InSilicoSpectro. The aim was provide a set of recurrent functions that are necessary for proteomics data analysis.

Some of the Illustrative functions are: mz list file format conversions, protein sequence digestion, theoretical peptide and fragment mass computations, graphical display, matching with experimental data, isoelectric point estimation (with different methods), and peptide retention time prediction.

At the end of the manuscript abstract the authors says: "We believe that InSilicoSpectro will be of great help to bioinformaticians, without detailed knowledge of proteomics specifics, and to mass spectrometrists with computer programming interest as well"

But what we can do with InsilicoSpectro & Bioperl:

Reading a Fasta File and make a Tryptic Digestion:



use Bio::SeqIO;

use Bio::Seq;
use IO::String;

use InSilicoSpectro;

use InSilicoSpectro::InSilico::MassCalculator;
use InSilicoSpectro::InSilico::CleavEnzyme;
use InSilicoSpectro::InSilico::AASequence;
use InSilicoSpectro::InSilico::Peptide;
use InSilicoSpectro::InSilico::IsoelPoint;
use InSilicoSpectro::InSilico::ExpCalibrator;
undef $InSilicoSpectro::InSilico::MassCalculator::invalidElementCall;

$inFile = Bio::SeqIO->new(-file => "$ARGV[0]", -format => 'fasta');

my $enzyme = $ARGV[1];      // Enzyme
my $miss      = $ARGV[2];      // Miss Cleavage Sites
my $name_out = $ARGV[3];  // out put file

InSilicoSpectro::init("insilicodef.xml"); // file of InsilicoSpectro Definitions

open (OUTDATA, ">$name_out") or  die("Error: cannot open file $name_out\n");

while (my $Protein = $inFile->next_seq()){

  $id = $Protein->display_id();             # Id. Protein.
  $seq = $Protein->seq();                   # String of sequence
  $description = $Protein->description();   # Description of the sequence 

  my $protein = new InSilicoSpectro::InSilico::Peptide(sequence=>"$seq",modif=>"");
  my $proteinSequence  = new InSilicoSpectro::InSilico::AASequence(sequence=>$seq, AC=>$id);
  $mass_value = $protein->getMass();
  $protein_mass_value = sprintf("%.6f", $mass_value);   // Get the protein mass
   @result = digestByRegExp(protein=>$ProteinSequence,minMass=>"0",nmc=>$misscleave, 
  foreach $p (@result){
    $peptide = $p->sequence;
    print (OUTDATA ">$id\t$description\t$protein_mass_value\n");
    print (OUTDATA "$peptide\n");
}close OUTDATA  or die("Error: cannot close file $name_out\n");  # wait for close output file.


Whit a simple script the user can obtain very good results without efforts. One of the key feature is the use of this library for protein database analysis. Proteomic Identifications are mainly based on Database Search. Is a common practice when you are writing your manuscript put some "estimations" or predictions about the possible behavior of the experiments. Most of the time you need to use database knowledge and a statistical background of the database, even before the experiment design...   

An example:

Distribution of mass for Unique Peptides for different peptide tolerance error in ppm:

The process could be divided in three steps. Each of then could be computed using InsilicoSpectro. We can reuse the first script to digest the protein sequences and put some code inside to filter by peptide mass and retrieve the peptide with the mass annotated:

 foreach $p (@result){
    $peptide = $p->sequence;
    my $peptideInsilico = new InSilicoSpectro::InSilico::Peptide(sequence=>"$peptide",modif=>"");
    $mass_value = $peptideInsilico->getMass();
    $peptide_mass = sprintf("%.6f", $mass_value);   // Get the protein mass 
   if (($peptide_mass >= $ARGV[4]) && ($peptide_mass <= $ARGV[5])){  
       print (OUTDATA "$peptide\t$peptide_mass\n");

This small change filter all peptide between 800-3500 (QTOF resolution) and compute the mass of each peptide. After this small change, an R script or a common perl algorithm can help to retrieve the histogram by peptide mass.

Another key feature of InsilicoSpectro is the estimation of different peptide/protein properties like isoelectric point, retention time.. Also function for Mascot .dat peptide/spectrum match extraction and processing.

the user can predict easily the isoelectric point with algorithm developed by David Tabb and different datatasets:

Amino acid
EMBOSS 8.6 3.6 8.5 3.9 4.1 6.5 10.8 12.5 10.1
DTASelect 8.0 3.1 8.5 4.4 4.4 6.5 10.0 12.0 10.0
Solomon 9.6 2.4 8.3 3.9 4.3 6.0 10.5 12.5 10.1
Sillero 8.2 3.2 9.0 4.0 4.5 6.4 10.4 12.0 10.0
Rodwell 8.0 3.1 8.33 3.68 4.25 6.0 11.5 11.5 10.07
Patrickios 11.2 4.2 - 4.2 4.2 - 11.2 11.2 -
Lehninger 9.69 2.34 8.33 3.86 4.25 6.0 10.5 12.4 10.0

Retention time could be compute with Krokin and Petritis methods with simple functions.

An example of isoelectric point calculation is:

 $pi = InSilicoSpectro::InSilico::IsoelPoint->new(method=>"iterative",current=>"Lehninger",%settings);
 $piPeptide = $pi->predict(peptide => uc $peptide);

An important module of the library is related with the spectrum processing. Even process is considered as "more specialized", some researcher with less bioinformatician background uses these functions to evaluate the spectrum quality...

InsilicoSpectro give an excellent opportunity to process your database and your identification data...

Take a look!!!