Saturday, 28 November 2015

Protein identification with Comet, PeptideProphet and ProteinProphet using BioDocker containers

Proteomics data analysis is dominated by database-based search engines strategies. Perhaps the most common protocol today is to retrieve raw data from a mass spectrometry, convert the raw data from binary format to a text-based format and then process it using a database search algorithm. The resulting data need to be statistically filtered in order to converge to a final list of identified peptides and proteins.

Amount Search Engines, Comet (the youngest son of SEQUEST) is one of the most popular nowadays. Today we are going to show how to run a simple analysis protocol using the Comet database search engine followed by statistical analysis using PeptideProphet and ProteinProphet, two of the most known and robust processing algorithms for proteomics data.

This pipeline is available in TPP, however several users prefer to use the individual components rather than Trans-proteomics Pipeline.  The big differential here is how we are going to do it. Instead of going through the step-by-step in how to install and configure Comet and TPP, we are going to run the pipeline using Docker containers from the BioDocker project (you can get more information on the project here).

Wednesday, 21 October 2015

Installing MESOS in your Mac

1- Homebrew is an open source package management system for the Mac that simplifies installation of packages from source.

ruby -e "$(curl -fsSL"

2- Once you have Homebrew installed, you can install Mesos on your laptop with these two commands:

brew update
brew install mesos

You will need to wait while the most current, stable version of Mesos is downloaded, compiled, and installed on your machine. Homebrew will let you know when it’s finished by displaying a beer emoji in your terminal and a message like the following:

/usr/local/Cellar/mesos/0.19.0: 83 files, 24M, built in 17.4 minutes
Start Your Mesos Cluster

3- Running Mesos in your machine: Now that you have Mesos installed on your laptop, it’s easy to start your Mesos cluster. To see Mesos in action, spin up an in-memory master with the following command:

/usr/local/sbin/mesos-master --registry=in_memory --ip=

A Mesos cluster needs at least one Mesos Master to coordinate and dispatch tasks onto Mesos Slaves. When experimenting on your laptop, a single master is all you need. Once your Mesos Master has started, you can visit its management console: http://localhost:5050

Since a Mesos Master needs slaves onto which it will dispatch jobs, you might also want to run some of those. Mesos Slaves can be started by running the following command for each slave you wish to launch:

sudo /usr/local/sbin/mesos-slave --master=

Tuesday, 6 October 2015

The end of the big files nightmare in Github

One of the nightmares of Github was always the Big Files.  The previous limit in 100MB made difficult to test applications with real examples demanding a lot of work to create "dummy" test files. Today the official announcement of GitHub:

Git LFS is an open source Git extension that we released in April for integrating large binary files into your Git workflow. Distributed version control systems like Git have enabled new and powerful workflows, but they haven’t always been practical for versioning large files.
Git LFS solves this problem by replacing large files with text pointers inside Git, while storing the file contents on a remote server like 
New git lfs fetch and git lfs pull commands that download objects much faster than the standard Git smudge filter
Options for customizing what files are automatically downloaded on checkout
Selectively ignore a directory of large files that you don’t need for daily work
Download recent files from other branches
Improvements to git lfs push that filter the number of commits to scan for eligible LFS objects to upload. This greatly reduces the time to push new feature branches
A Windows installer and Linux packages for more convenient installation
An experimental extension system for teams that want to customize how objects are stored on the server
Git LFS is now available to all users on, just install the client to get started.
I just added my first big file with this steps:

1 - Download the git plugin from here or using Homebrew
   brew install git-lfs

2- Select the file types you'd like Git LFS to manage (or directly edit your .gitattributes). You can configure additional file extensions at anytime.

git lfs track "*.psd"
3 -There is no step three. Just commit and push to GitHub as you normally would.

git add file.psd
git commit -m "Add design file"
git push origin master
Done !!!!

Wednesday, 30 September 2015

First Scrum Board

Here, my first Scrum board to guide the release of OmicsDI project.

Team members update the task board continuously each sprint; if someone thinks of a new task (“test a new machine learning algorithm”), she writes a new card and puts it on the wall. Either during or before the daily scrum, estimates are changed (up or down), and cards are moved around the board.

Each row on the Scrum board is a user story, which is the unit of work we encourage teams to use for their product backlog.

During the sprint planning meeting, the team selects the product backlog items they can complete during the next Spring. Each product backlog item is turned into multiple sprint backlog items. Each of these is represented by one task card that is placed on the Scrumboard.

  • Story (User Story): The story description (“As a user we want to…”) shown on that row.
  • Ongoing:  Any card being worked on goes here. The programmer who chooses to work on it moves it over when she's ready to start the task. Often, this happens during the daily scrum when someone says, “I'm going to work on the boojum today.”
  • Testing: A lot of tasks have corresponding test task cards. So, if there's a “Code the boojum class” card, there is likely one or more task cards related to testing: “Test the boojum”, “Write FitNesse tests for the boojum,” “Write FitNesse fixture for the boojum,” 
  • Done: Cards pile up over here when they're done. They're removed at the end of the sprint. Sometimes we remove some or all during a sprint if there are a lot of cards.

Optionally, depending on the team, the culture, the project and other considerations:
  • Notes: Just a place to jot a note or two.
  • Tests Specified: We like to do “Story Test-Driven Development,” or “Acceptance Test-Driven Development,” which means the tests are written before the story is coded. Many teams find that it helps to have acceptance tests identified before coding begins on a particular story. This column just contains a checkmark to indicate the tests are specified.

Friday, 11 September 2015

An API for all MS-based File formats

We recently released and published our first Java API (Application Programming Interface) for the most common file formats in proteomics, not only ms files but also identification files such as mzIdentML and mztab. 

ms-data-core-api (

The library allow the end-users and the developers to use a common data structure for proteomics independently of the file types, and .. But first lets try to understand what is a API.

What is an API?

Imagine you are a builder or civil engineering and your are building your bridge, different components, blocks and different teams needs to be coordinated and plugged for the final results. Wrong communications between the members of the teams, different block sizes or building plans only produced strange results. 

In the simplest terms, APIs are sets of requirements, data structures, objects that govern how applications and software components can talk each other. An API, is a set of routines and protocols that provide building blocks for computer programmers and web developers to build software applications. In the past, APIs were largely associated with computer operating systems and desktop applications. In recent years though, we have seen the emergence of Web APIs (Web Services).

What is ms-data-core-api?

The ms-data-core-api is a free, open-source library for developing computational proteomics tools and pipelines. The Application Programming Interface, written in Java, enables rapid tool creation by providing a robust, pluggable programming interface and common data model. The data model is based on controlled vocabularies/ontologies and captures the whole range of data types included in common proteomics experimental workflows, going from spectra to peptide/protein identifications to quantitative results. 

The library contains readers for three of the most used Proteomics Standards Initiative standard file formats: mzML, mzIdentML, and mzTab. In addition to mzML, it also supports other common mass spectra data formats: dta, ms2, mgf, pkl, apl (text-based), mzXML and mzData (XML-based). Also, it can be used to read PRIDE XML, the original format used by the PRIDE database, one of the world-leading proteomics resources. Finally, we present a set of algorithms and tools whose implementation illustrates the simplicity of developing applications using the library.

Saturday, 29 August 2015

DIA-Umpire Pipeline Using BioDocker containers.

The complexity of some bioinformatic softwares is well-known and it has been commented in different papers and blog posts, etc. Especially, those softwares that depend of many software components and tools making impossible for a testing/new-user try for the first time the software. @BioDocker aim to simplify the process of testing/compiling/deploying bioinfo softwares. Our previously post shows how to use the TPP software from System Biology team

Recently, the Data Independent Acquisition Methods has been receiving a lot of attention by the proteomics community, specially SWATH. In this example We are going to demonstrate the importance of Docker through the use of a complex and powerful pipeline called DIA-Umpire. In this example I will demonstrate how to download, run and obtain the results from the DIA-Umpire pipeline.