Saturday, 25 August 2012

Why R for Mass Spectrometrist and Computational Proteomics

Why R:

Actually, It is a common practice the integration of the statistical analysis of the resulted data and in silico predictions of the data generated in your manuscript and your daily research. Mass spectrometrist, biologist and bioinformaticians commonly use programs like excel, calc or other office tools to generate their charts and statistical analysis. In recent years many computational biologists especially those from the Genomics field, regard R and Bioconductor as fundamental tools for their research.

R is a modern, functional programming language that allows for rapid development of ideas; it is a language and environment for statistical computing and graphics.The rich set of inbuilt functions makes it ideal for high-volume analysis or statistical studies.


 
Installing R on Windows or Linux:

Windows: You can download the last version from: http://www.r-project.org/ you need to select a mirror, then in the base page you can select the last release of R. The next steps are really straightforward like Windows aplications.

Linux: You can download the latest precompile release from the same page (http://www.r-project.org/) for (suse, devian, ubuntu, redhat) and the source files in R-XXX. tar.gz.

Here you can find some tips if you have problem to install R http://cran.r-project.org/doc/manuals/R-admin.html


First MS Example in Three lines:

"I want to know the mass distribution of my identified peptides"

First create a peptide-histogram.txt file with the list of mass as follow:

1392.6207
1576.7609
1809.956
1653.8549
1929.0003 
then
> peptides.txt <- read.table("peptide-histogram.txt", header=FALSE)
> peptides <-as.vector(peptides.txt$V1)
> hist(peptides,breaks=400) 

*if you want to compute the mean of the masses, it's simple:
> mean(peptides) 
[1] 1791.695                                                                     


The hist() function can be customize with different options (remember you can always see the help for each funtion using ? , for example: ?hist):

http://msenux.redwoods.edu/math/R/hist.php
http://stat.ethz.ch/R-manual/R-patched/library/graphics/html/hist.html

One of the key advances to work with R is the amount of data that you can analyze, some desktop tool have row limits (for example MS excel previous to version 2007 is 65536 and MS 2007 is 1,048,576). Other reasons to consider R: (1) commercial software's such as SPSS are expensive and not up-to-date; (2) public website services has a limited data volume; (3) self written software is not an option "mass spectrometrist are not IT people".

Generating the Venns for Search Engines (Mascot, XTadem, Sequest)

" I want a Venn diagram with the share proteins identified with Sequest, XTandem and Mascot"

Each file mascot.txt, xtandem.txt, sequest.txt is the list of Protein IDs..
* you can use the uniprot www.uniprot.org mapping service pr PICR http://www.ebi.ac.uk/Tools/picr/ to convert different PROTEIN IDs to a unique representation.   

>library(gplots)
>mascot.txt<-read.table("mascot.txt", header=FALSE)
>xtandem.txt<-read.table("xtandem.txt", header=FALSE) 
>sequest.txt<-read.table("sequest.txt", header=FALSE)
>sequest<- as.vector(sequest.txt$V1)
>mascot<- as.vector(mascot.txt$V1)
>xtandem<- as.vector(xtandem.txt$V1)
>input<- list(Mascot=mascot, XTandem=xtandem, sequest=sequest)
>venn(input)


The venn diagrams are part of the gplots library and they are really useful to show all possible logical relations between a finite collection of sets.


When i read for the first time "Five statistical things I wished I had been taught 20 years ago" (Ewan Birney) the first thing that i thought was "...which R packages must be useful for mass spectrometrist such as biologist case.
  • The ggplot2 for data visualization guaranty a set of functions to represent your data such as: Scatterplot function (Basic Introduction to ggplot2).
  • The caret package (short for Classification And REgression Training) is a set of functions that attempt to streamline the process for creating predictive models. It is a complete package for regression and classification techniques(caret)
  • The factominer is an R package dedicated to multivariate Exploratory Data Analysis.It performs classical methods such as Principal Components Analysis (PCA), Correspondence analysis (CA), Multiple Correspondence Analysis (MCA) as well as more advanced methods. GUI is available. (factominerR
  • The mzR provides a unified API to the common file formats and parsers available for mass spectrometry data. It comes with a wrapper for the ISB random access parser for mass spectrometry mzXML, mzData and mzML files. (mzR)
  • The Bioconductor provides tools for the analysis and comprehension of high-throughput biology data. Bioconductor has two releases each year, 554 software packages, and an active user community. (bioconductor
  • The msProcess provides tools for protein mass spectra processing including data preparation, denoising, noise estimation, baseline correction, intensity normalization, peak detection, peak alignment, peak quantification, and various functionalities for data ingestion/conversion, mass calibration, data quality assessment, and protein mass spectra simulation. (msProcess)

Learning R is an ongoing process, and once researchers have mastered the basics, they should be encouraged to explore the wealth of contributed packages on the Comprehensive R Archive Network (CRAN) (http://cran.r-project.org) and Bioconductor (http://www.bioconductor.org). If we start to use R in our labs, we can provide our scripts to the comunity using our manuscripts and papers, it means we can check the statistics analysis and the results. R is the leading tool for statistics, data analysis, and machine learning in the research community is time. Time to begin!!!!

Some Ref's:
  1. Statistics Using R with Biological Examples (http://cran.r-project.org/doc/contrib/Seefeld_StatsRBio.pdf)
  2. Biological Data Analysis Using R (http://dyerlab.bio.vcu.edu/downloads/Dyer_Data_Analysis_Using_R.pdf) 
  3. R-bloggers (http://www.r-bloggers.com/)
  4. https://github.com/lgatto/RforProteomics
  5. http://bioconductor.org/packages/release/data/experiment/html/RforProteomics.html
  6. https://groups.google.com/forum/#!forum/rbioc-sig-proteomics