Wednesday, 6 March 2013

#HavanaBioinfo2012 Hard, but Awesome Experience


From 8th to 11th of last December the "I Bioinformatics for Biotechnology Applications" (#HavanaBioinfo2012) was held in the hotel “Occidental Miramar” located within an elegant area in Havana, Cuba. Putting on a Bioinformatic workshop takes a lot of different pieces (small ones and big ones). You need write lot of mails to invited speakers, you need a nice and comfortable place with lots of tables and chairs and good food and the drinks. 

I started from October, writing the first mails to EBI friends, advisers and professors (www.ebi.ac.uk), and to be complete honest it was fantastic because most if then accepted from the very beginning. Thanks to Henning Hermjakob and Alex Bateman (@Alexbateman1) for the support. At the end the workshop was fully subscribed, with more than 45 attendees and sixteen speakers, participating in two poster sessions and a panel discussion. Speakers from EBI, Belgium University, Mascot (UK), Bioinformatics Solutions (Canada) accepted the invitation to come here and give one or two lectures about proteomics, genomics, bioinformatics. 

Three of the more commonly used MS/MS search engine attended to the conference Mascot (David Creasy - Matrix Science), MaxQuant - Andromeda (Jürgen Cox Cox - Mann Lab) and Peaks (Paul Shan - Bioinformatics Solutions). 

The workshop was organized in three sessions dedicated to “System biology resources”, “Protein identification and quantitation” and “Molecular drug design”. On the first day, I gave a brief introduction and the welcome words to the invited speakers and the students and Dr. Gerardo Guillen (research director at CIGB) described the history and current developments of Cuban. All invited speakers were surprised about the impact of the Cuban products on the health care system and the current product pipeline of CIGB, particularly the Heberprot-P results. 

Henning Hermjakob (EBI) explained the EBI resources from Molecular Interactions (IntAct) via curated human pathways (Reactome) to Systems Biology Models (BioModels). Particularly, the Proteomics Standard Initiative (PSI) Common Query Interface (PSICQUIC) motivated an interesting discussion about remote access resources. Finalizing the system biology topic, Henning Hermjakob and Marco Punta described in detail the UniProt and PFam resources.

Baozhen Paul Shan (Bioinformatics Solutions Inc.) described the history, theory, and practice of de novo identification strategy. The speaker demonstrated the actual scoring algorithm in PEAKS, and explained the fundamentals without losing the non-mathematicians in the audience. Continuing the de novo theme, Felipe Leprevost (Fiocruz, Brazil) explained the PepExplorer application, an integrated system to organize and statistically filter de novo sequencing results. The integration in one workflow, using the database search strategy and the de novo algorithm pepNovo, increases the number of peptides and proteins identified. 

In the afternoon, Lennart Martens (Ghent University and VIB) talked about the “CompOmics toolsuite”. During the last twelve years Lennart's group has developed a broad set of Java tools for proteomics data analysis. The source code, documentation and a complete set of examples for the main code library are freely available at http://compomics-utilities.googlecode.com. Closing the first day, Klemens Vierlinger (Health & Environment Department/AIT/Vienna, Austria) described the current challenges in meta-analysis and data integration in biomarker discovery, especially in human fibrotic disease. After the afternoon coffee break, the poster session included the discussion of eleven posters by students from Cuba, Mexico and Colombia.

The second day was dedicated entirely to protein identification strategies and tools. The possibility to interact and discuss with David Creasy (Matrix Science, Mascot search engine) and Jürgen Cox (Max Planck Institute, Martinsried, MaxQuant-Andromeda software) about the scoring systems and platform fundamentals ensured a productive session. 

David Creasy (Matrix Science) described the history, theory, and practice of Mascot search engine and tools. David pointed out some of the parameters in Mascot that may cause problems if not 200 properly employed. For example, doing a non-enzyme search in Mascot is not a good idea unless there is a very high level of non-specific peptides expected in the sample. Semi-trypsin is almost always a better choice if the peptides came from a tryptic digest. David also explained that one of the future very promising fields is the inclusion of spectral library search in the current proteomic workflows, as is already available through SpectraST or X!Hunter.

The ensuing coffee break was particularly motivated by Mascot discussions, some of the non-answered questions were: Why is Mascot successful and extensively used even with the existence of different freely available tools such as X!Tandem, OMSSA and Andromeda?; How can the Mascot scoring system be at the same time powerful yet simple?; Why don't popular search engines consider the intensity of the signals in the scoring systems? The organizers decided to give an additional 10 min of coffee break time just to boost this dynamic and enthusiastic discussion environment.

Jürgen Cox (Max-Planck Institute for Biochemistry, Munich, Germany) introduced the MaxQuant platformfor high-resolution mass spectrometry experiments. Recent revolutionary advances in high accuracy mass spectrometry-based proteomics are providing a new basis for data-driven systems biology. Jürgen described the algorithms and whole workflows encompassing the mass spectrometry data analysis from intelligent data-driven acquisition, via algorithms for identification and quantification of proteins, to the statistical analysis of the final expression data for proteins and posttranslational modifications in the context of other omics and pathway data.

Before lunchtime, Henning Hermjakob described the current status of the proteomics repository services in the European Bioinformatics Institute. The PRotein IDEntification Database started in 2005 and in the last update contains 11,629,064 identifications and 338,501,793 spectra, supporting the most common spectrum and identification file formats. 

The last day was entirely dedicated to molecular drug design and chemoinformatics. The opening lecture entitled “Rational design of peptide inhibitors against Dengue virus” was given by Glay Chinea (CIGB). An overview regarding the Dengue virus, its prevalence and typical clinical outcomes was first introduced. Violeta Perez-Nueno from Orpailleur Team (INRIA Nancy) presented several approaches that can be used to model molecular interactions and more deeply a new 3D shape-based approach for predicting and quantifying drug promiscuity by correlating Gaussian clusters of ligand spherical harmonic shapes. The presentation entitled “Epitope-based vaccines — From high-throughput data to individualized therapies” by Oliver Kohlbacher triggered an enthusiastic exchange of ideas. Epitope-based vaccines (EVs) have recently been attracting growing interest. The success of an EV is determined by the choice of epitopes used as a basis. After lunch, a conference entitled “Challenges in homology search: HMMER3 and convergent evolution of coiled-coil regions” was given by Marco Punta. Sequence alignment programs may miss or mis-identify homologous relationships between proteins based on different factors, including homologous overextension and convergent evolution (as observed incompositionally biased amino acid regions). He presented a study where the Pfam collection of manually curated profile hidden Markov models is used to test the accuracy with which the alignment program HMMER3 assigns protein sequences to homologous families. 

During the Workshop we visited different historic and turism places in Havana and Pinar del Rio. Some pictures:

Pinar del Rio
Nice introduction about how cuban farmers make the cigars (Tobacco House)

Bodeguita del Medio

Interesting discussion (no about search engine performance) about old Havana architecture. 
Old American Car.. is our common taxi
Occidental Miramar Venue after dinner



Pinar del Rio

Monday, 3 December 2012

#HavanaBioinfo2012 Workshop

As part of Heberprot-P Havana 2012 International Congress, the CIGB is organizing a pre-congress Workshop on Bioinformatics and Biotechnology Applications. These tools are widely used in studies related to the EGF-EGFr system and will be of relevant importance in the future development of new therapeutics. The course will cover topics such as: computational proteomics and genomics, data integration, expression data analysis and regulatory networks, protein-protein interaction networks, mathematical models of biochemical pathways, drug design, virtual screening, docking and QSAR. Bioinformatics and OMICS, Havana 2012 will be held from December 8th to 11th 2012 at Occidental Miramar Hotel. (Workshop Site)

The workshop would include numbers of advanced topics regarding computational proteomics, system biology resources, molecular modeling and drug design.  This workshop has been designed to be a place for exchange new ideas. Some of the lectures will provide new skills and knowledge about software platform for protein identification, docking, molecular dynamics and biological databases. Sixteen professors and invited speakers representing leading Bioinformatics Companies (Matrix, Bioinformatics Solutions) and Academic Groups from Ghent University, Nancy University and Tubingen) are attended. Therefore, this workshop will provide us not only essential knowledge about bioinformatics and its biotechnology applications, but also a great opportunity to share experiences about the current and future challenges in Bioinformatics.



Scientific Program:

Saturday 8: Bioinformatics and System Biology (Chairpersons: Henning Hermjakob, Yasset Perez-Riverol)

08:00 – 08:10: Welcome to I Bioinformatics Workshop in Biotechnology App (Yasset Perez Riverol - CIGB)
08:15 – 08:45: Opening Lecture: Cuban Biotechnology (Gerardo Guillen - CIGB)

08:55 – 09:25: Network Biology at CIGB. (Ricardo Bringas - CIGB)

09:35 – 10:05: Protein Sequence Database Evolution – UniProt. (Henning Hermjakob - EBI)

10:05 – 10:25: Coffee Break

10:25 – 10:55: The Pfam protein families database. (Marco Punta - Sanger)

11:05 – 11:35: EBI Resources from Molecular Interactions to Systems Biology Models. (Henning Hermjakob – EBI)

11:45 – 12:15: PEAKS – homology matching-assisted shotgun protein sequencing. (Baozhen Shan – Peaks, Canada)

12:25 – 01:00: PEAKS – integration of database search and de novo sequencing (Dr. Baozhen Shan – Peaks, Canada)

01:00 – 02:30: Lunch Time     

02:30 – 03:00: PepExplorer: organizing and filtering of de novo results (Felipe Leprevost - Brazil)

03:10 – 03:40: The CompOmics toolsuite, helping you take a few steps in the right direction. (Lennart Martens - Belgium)

03:50 – 04:20: Tapping into collective knowledge - mining the public proteome. (Lennart Martens - Belgium) 

04:20 – 04:40:    Coffee Break    

04:50 – 05:20     Meta analysis and data integration in biomarker discovery (Klemens Vierlinger)

05:30 – 06:30:    Posters

Sunday 9: Bioinformatics and Proteomics (Chairpersons: Prf. David Creasy, Dr. Luis Javier González)
08:00 – 08:30:  Opening Lecture: Proteomics at CIGB (Vladimir Besada Perez- CIGB)

08:40 – 09:30: From raw data to an accurate, concise list of proteins using the Mascot tools. (David Creasy – Matrix Science)

09:35 – 10:05: Quantitation: workflows, tools, tips and causes of inaccurate results. (David Creasy – Matrix Science)

10:05 – 10:30: Coffee Break

10:45 – 11:30: OpenMS -- A Tutorial. (Oliver – Germany)   

11:40 – 12:05: Workflows in Proteomics and Metabolomics. (Oliver – Germany)

12:10 – 12:45: Introduction to the MaxQuant platform for mass spectrometry-based computational proteomics. (Jürgen Cox – Max Planck)

12:50 – 1:20: Computational strategies and software solutions for the downstream bioinformatics analysis of proteomics data. (Jürgen Cox – Max Planck)

01:25 – 01:50: PRIDE and ProteomeXchange - Co-ordinating proteomics data dissemination. (Henning Hermjakob - EBI)

 02:00– 03:30:    Lunch Time   
03:30 – 05:30:    Other activities


Monday 10: Visit to Research Institution (CIGB)

Tuesday 11: Bioinformatics in Drug Design (Chairpersons: Prf. Oliver Kohlbacher, Osmany Guirola)

08:00 – 08:30: Opening Lecture: Rational design of peptide inhibitors of dengue virus. (Glay Chinea - CIGB)

08:40 – 09:30: Epitope-Based Vaccines - From high-throughput data to   individualized therapies. (Oliver - Germany)

09:40 – 10:10: An Overview of Modeling Molecular Interactions at INRIA Nancy (Violeta Nueno – Harmonic Pharma Senior Scientist, former researcher in the INRIA Orpailleur team)

10:10 – 10:30: Coffee Break 

10:45 – 11:15: Preclinical Compound Profiling and Drug Repositioning (Violeta Nueno – Harmonic Pharma Senior Scientist, former researcher in the INRIA Orpailleur team)

11:30 – 12:00: Challenges in homology search: HMMER3 and convergent evolution of coiled-coil regions. (Marco Punta - EBI)

12:10 – 12:45: In silico design of IL-2 mutants for the therapy of cancer. (Kalet Leon – CIM)

01:00 – 02:30: Lunch Time     

02:30 – 03:00: Molecular bases of Plasmodium falciparum plasmepsins selective inhibition. (Pedro Valiente – UH)

03:10 – 03:40: Cardiovascular Genomics in the Clinic (Michal Blazejczyk)   

04:00 – 04:20: Coffee Break 
         
04:00 – 05:30: General and Posters Discussion      

You can follow the entire workshop activities by linkedin or twitter


Saturday, 25 August 2012

Why R for Mass Spectrometrist and Computational Proteomics

Why R:

Actually, It is a common practice the integration of the statistical analysis of the resulted data and in silico predictions of the data generated in your manuscript and your daily research. Mass spectrometrist, biologist and bioinformaticians commonly use programs like excel, calc or other office tools to generate their charts and statistical analysis. In recent years many computational biologists especially those from the Genomics field, regard R and Bioconductor as fundamental tools for their research.

R is a modern, functional programming language that allows for rapid development of ideas; it is a language and environment for statistical computing and graphics.The rich set of inbuilt functions makes it ideal for high-volume analysis or statistical studies.
Installing R on Windows or Linux:

Windows: You can download the last version from: http://www.r-project.org/ you need to select a mirror, then in the base page you can select the last release of R. The next steps are really straightforward like Windows aplications.

Linux: You can download the latest precompile release from the same page (http://www.r-project.org/) for (suse, devian, ubuntu, redhat) and the source files in R-XXX. tar.gz.

Here you can find some tips if you have problem to install R http://cran.r-project.org/doc/manuals/R-admin.html


First MS Example in Three lines:

"I want to know the mass distribution of my identified peptides"

First create a peptide-histogram.txt file with the list of mass as follow:

1392.6207
1576.7609
1809.956
1653.8549
1929.0003 
then
> peptides.txt <- read.table("peptide-histogram.txt", header=FALSE)
> peptides <-as.vector(peptides.txt$V1)
> hist(peptides,breaks=400) 

*if you want to compute the mean of the masses, it's simple:
> mean(peptides) 
[1] 1791.695                                                                     


The hist() function can be customize with different options (remember you can always see the help for each funtion using ? , for example: ?hist):

http://msenux.redwoods.edu/math/R/hist.php
http://stat.ethz.ch/R-manual/R-patched/library/graphics/html/hist.html

One of the key advances to work with R is the amount of data that you can analyze, some desktop tool have row limits (for example MS excel previous to version 2007 is 65536 and MS 2007 is 1,048,576). Other reasons to consider R: (1) commercial software's such as SPSS are expensive and not up-to-date; (2) public website services has a limited data volume; (3) self written software is not an option "mass spectrometrist are not IT people".

Generating the Venns for Search Engines (Mascot, XTadem, Sequest)

" I want a Venn diagram with the share proteins identified with Sequest, XTandem and Mascot"

Each file mascot.txt, xtandem.txt, sequest.txt is the list of Protein IDs..
* you can use the uniprot www.uniprot.org mapping service pr PICR http://www.ebi.ac.uk/Tools/picr/ to convert different PROTEIN IDs to a unique representation.   

>library(gplots)
>mascot.txt<-read.table("mascot.txt", header=FALSE)
>xtandem.txt<-read.table("xtandem.txt", header=FALSE) 
>sequest.txt<-read.table("sequest.txt", header=FALSE)
>sequest<- as.vector(sequest.txt$V1)
>mascot<- as.vector(mascot.txt$V1)
>xtandem<- as.vector(xtandem.txt$V1)
>input<- list(Mascot=mascot, XTandem=xtandem, sequest=sequest)
>venn(input)


The venn diagrams are part of the gplots library and they are really useful to show all possible logical relations between a finite collection of sets.


When i read for the first time "Five statistical things I wished I had been taught 20 years ago" (Ewan Birney) the first thing that i thought was "...which R packages must be useful for mass spectrometrist such as biologist case.
  • The ggplot2 for data visualization guaranty a set of functions to represent your data such as: Scatterplot function (Basic Introduction to ggplot2).
  • The caret package (short for Classification And REgression Training) is a set of functions that attempt to streamline the process for creating predictive models. It is a complete package for regression and classification techniques(caret)
  • The factominer is an R package dedicated to multivariate Exploratory Data Analysis.It performs classical methods such as Principal Components Analysis (PCA), Correspondence analysis (CA), Multiple Correspondence Analysis (MCA) as well as more advanced methods. GUI is available. (factominerR
  • The mzR provides a unified API to the common file formats and parsers available for mass spectrometry data. It comes with a wrapper for the ISB random access parser for mass spectrometry mzXML, mzData and mzML files. (mzR)
  • The Bioconductor provides tools for the analysis and comprehension of high-throughput biology data. Bioconductor has two releases each year, 554 software packages, and an active user community. (bioconductor
  • The msProcess provides tools for protein mass spectra processing including data preparation, denoising, noise estimation, baseline correction, intensity normalization, peak detection, peak alignment, peak quantification, and various functionalities for data ingestion/conversion, mass calibration, data quality assessment, and protein mass spectra simulation. (msProcess)

Learning R is an ongoing process, and once researchers have mastered the basics, they should be encouraged to explore the wealth of contributed packages on the Comprehensive R Archive Network (CRAN) (http://cran.r-project.org) and Bioconductor (http://www.bioconductor.org). If we start to use R in our labs, we can provide our scripts to the comunity using our manuscripts and papers, it means we can check the statistics analysis and the results. R is the leading tool for statistics, data analysis, and machine learning in the research community is time. Time to begin!!!!

Some Ref's:
  1. Statistics Using R with Biological Examples (http://cran.r-project.org/doc/contrib/Seefeld_StatsRBio.pdf)
  2. Biological Data Analysis Using R (http://dyerlab.bio.vcu.edu/downloads/Dyer_Data_Analysis_Using_R.pdf) 
  3. R-bloggers (http://www.r-bloggers.com/)

Thursday, 9 August 2012

Computational Methods of AP/MS Protein Interaction Data

I want to share with you an excellent presentation from Professor Alexey Nesvizhskii (Dept. of Pathology, University of Michigan). Even when this presentation is from 2009, some concepts like protein inference, label free quantification are now generating an important number of new algorithms are tools. Also, it is an excellent starting point for biologist and developers about Computational Proteomics Algorithms and Methods. 
  


Monday, 21 May 2012

An "in-house" Tool

One of the small hidden details in publications, even in those with a higher impact, is the use of "in-house programs". What is an "in-house" program or tool: Normally is a piece of software that researchers use to analyze process or visualize the experimental data, but most important the software it-self is not published

The term by itself is inoffensive, but the concept could be extremely dangerous. We can cite hundreds of manuscripts that included in the data analysis "in-house" tools, but never the terms "in-house instruments". The authors always needs to cite the manufacturer, the reagents, even the year and the company. I know, we have a section to describe data processing but mostly we cite some parameters, and the well known software like search engines (Mascot, X!Tandem, Sequest, etc). But at some point of this section several times you can find the term "in-house" tool. It could be a reference to an excel formula or to a complete and complex java program with many tasks like parsing a search engine output, computing the FDR, removing false-positive identifications, computing peptide-spectrum-match redundancy, etc. The are not a real/objective measure to distinguish between a little-simple tool and a complex tool one.  

What does it mean:
  • It is difficult to follow the results when the researchers used complex in-house programs. 
  • Impossible to evaluate the results if you don't know the methods and algorithms inside the in-house tools. 
  • The most important thing is: Results are not reproducible, not comparable!!!

Some journals force authors to attach the code, and the programs to be used by the readers of the article. But it is still a problem in the community and is growing because more methods and algorithms are public available and more non-bioinformatician researchers have programming skills. 

Some side-hidden disadvantages are:
  • Some tools can handle with most of these analysis, but they are not used at all. Even when these tools are published on important-high impact journals.
  • "Small" but very important problems in the community do not have standardized and well tested tools to solve them.
  • Software and bioinformatics solutions are underestimated. 
When a tool is designed/tested and stressed during the publication process, all of the errors, incompatibilities, statistical details, are fixed in order to report the results. In the process several datasets can be used to compare the results obtained using different settings, etc. This is the nature of a tool/algorithm publication.

The reviewers and editors should force authors to justify the use of "in-house" programs. Also, if an in-house program is needed the code of the programs must be attached and also a user manual, as well as a) short document explaining the algorithms used by the tool.

Several journals can be used to publish bioinformatics tools as a research or technical note (not only big tools): 
  • Source Code for Biology and Medicine (http://www.scfbm.org/about)
  • Bioinformatics (http://bioinformatics.oxfordjournals.org/)
  • BMC Bioinformatics (http://www.biomedcentral.com/bmcbioinformatics/
What do you think?

Wednesday, 25 April 2012

Perl Proteomics & InSilicoSpectro

In contrast with genomics, bioinformaticians in proteomics don’t have a "big" and "complete" perl library for proteomics data analysis. It could be related with the "heterogeneity" in proteomics. A lot of different instruments, protocols, properties. Also genomic have a huge community (bioinformaticians) and standardize tools (instruments and software’s). In 2006 Collinge and Masselot published an open-source perl library named InSilicoSpectro. The aim was provide a set of recurrent functions that are necessary for proteomics data analysis.

Some of the Illustrative functions are: mz list file format conversions, protein sequence digestion, theoretical peptide and fragment mass computations, graphical display, matching with experimental data, isoelectric point estimation (with different methods), and peptide retention time prediction.


At the end of the manuscript abstract the authors says: "We believe that InSilicoSpectro will be of great help to bioinformaticians, without detailed knowledge of proteomics specifics, and to mass spectrometrists with computer programming interest as well"

But what we can do with InsilicoSpectro & Bioperl:

Reading a Fasta File and make a Tryptic Digestion:

----------------

#!/usr/bin/perl

use Bio::SeqIO;

use Bio::Seq;
use IO::String;

use InSilicoSpectro;

use InSilicoSpectro::InSilico::MassCalculator;
use InSilicoSpectro::InSilico::CleavEnzyme;
use InSilicoSpectro::InSilico::AASequence;
use InSilicoSpectro::InSilico::Peptide;
use InSilicoSpectro::InSilico::IsoelPoint;
use InSilicoSpectro::InSilico::ExpCalibrator;
undef $InSilicoSpectro::InSilico::MassCalculator::invalidElementCall;

$inFile = Bio::SeqIO->new(-file => "$ARGV[0]", -format => 'fasta');

my $enzyme = $ARGV[1];      // Enzyme
my $miss      = $ARGV[2];      // Miss Cleavage Sites
my $name_out = $ARGV[3];  // out put file

InSilicoSpectro::init("insilicodef.xml"); // file of InsilicoSpectro Definitions

open (OUTDATA, ">$name_out") or  die("Error: cannot open file $name_out\n");

while (my $Protein = $inFile->next_seq()){

 
  $id = $Protein->display_id();             # Id. Protein.
  $seq = $Protein->seq();                   # String of sequence
  $description = $Protein->description();   # Description of the sequence 

  my $protein = new InSilicoSpectro::InSilico::Peptide(sequence=>"$seq",modif=>"");
  my $proteinSequence  = new InSilicoSpectro::InSilico::AASequence(sequence=>$seq, AC=>$id);
  $mass_value = $protein->getMass();
  $protein_mass_value = sprintf("%.6f", $mass_value);   // Get the protein mass
  
   @result = digestByRegExp(protein=>$ProteinSequence,minMass=>"0",nmc=>$misscleave, 
                    enzyme=>InSilicoSpectro::InSilico::CleavEnzyme::getFromDico("$enzyme"));
 
  foreach $p (@result){
    $i++;
    $peptide = $p->sequence;
    print (OUTDATA ">$id\t$description\t$protein_mass_value\n");
    print (OUTDATA "$peptide\n");
  }
}close OUTDATA  or die("Error: cannot close file $name_out\n");  # wait for close output file.


---------------------

Whit a simple script the user can obtain very good results without efforts. One of the key feature is the use of this library for protein database analysis. Proteomic Identifications are mainly based on Database Search. Is a common practice when you are writing your manuscript put some "estimations" or predictions about the possible behavior of the experiments. Most of the time you need to use database knowledge and a statistical background of the database, even before the experiment design...   

An example:

Distribution of mass for Unique Peptides for different peptide tolerance error in ppm:








The process could be divided in three steps. Each of then could be computed using InsilicoSpectro. We can reuse the first script to digest the protein sequences and put some code inside to filter by peptide mass and retrieve the peptide with the mass annotated:

 foreach $p (@result){
    $i++;
    $peptide = $p->sequence;
    my $peptideInsilico = new InSilicoSpectro::InSilico::Peptide(sequence=>"$peptide",modif=>"");
    $mass_value = $peptideInsilico->getMass();
    $peptide_mass = sprintf("%.6f", $mass_value);   // Get the protein mass 
   if (($peptide_mass >= $ARGV[4]) && ($peptide_mass <= $ARGV[5])){  
       print (OUTDATA "$peptide\t$peptide_mass\n");
   }
 }

This small change filter all peptide between 800-3500 (QTOF resolution) and compute the mass of each peptide. After this small change, an R script or a common perl algorithm can help to retrieve the histogram by peptide mass.

Another key feature of InsilicoSpectro is the estimation of different peptide/protein properties like isoelectric point, retention time.. Also function for Mascot .dat peptide/spectrum match extraction and processing.

the user can predict easily the isoelectric point with algorithm developed by David Tabb and different datatasets:


Amino acid
NH2 COOH C D E H K R Y
EMBOSS 8.6 3.6 8.5 3.9 4.1 6.5 10.8 12.5 10.1
DTASelect 8.0 3.1 8.5 4.4 4.4 6.5 10.0 12.0 10.0
Solomon 9.6 2.4 8.3 3.9 4.3 6.0 10.5 12.5 10.1
Sillero 8.2 3.2 9.0 4.0 4.5 6.4 10.4 12.0 10.0
Rodwell 8.0 3.1 8.33 3.68 4.25 6.0 11.5 11.5 10.07
Patrickios 11.2 4.2 - 4.2 4.2 - 11.2 11.2 -
Lehninger 9.69 2.34 8.33 3.86 4.25 6.0 10.5 12.4 10.0

Retention time could be compute with Krokin and Petritis methods with simple functions.

An example of isoelectric point calculation is:


 $pi = InSilicoSpectro::InSilico::IsoelPoint->new(method=>"iterative",current=>"Lehninger",%settings);
 $piPeptide = $pi->predict(peptide => uc $peptide);

An important module of the library is related with the spectrum processing. Even process is considered as "more specialized", some researcher with less bioinformatician background uses these functions to evaluate the spectrum quality...

InsilicoSpectro give an excellent opportunity to process your database and your identification data...

Take a look!!!

http://search.cpan.org/~alexmass/InSilicoSpectro-1.3.24/
http://pubs.acs.org/doi/abs/10.1021/pr0504236
http://pubs.acs.org/doi/abs/10.1021/ac070488n
http://pubs.acs.org/doi/abs/10.1021/pr700840y
http://pubs.acs.org/doi/abs/10.1021/pr200031z