Sunday, 13 March 2016

Genome Mapping and SNP Calling with BioDocker

#http://www.htslib.org/workflow/#mapping_to_variant
set -xeu

FQ1=y1.fastq
FQ2=y2.fastq
REF=yeast.fasta
BNM=yeastD

RUNINDOCKER=1

SAMTOOLS=samtools
BWA=bwa
TABIX=tabix
BCFTOOLS=bcftools
PLOTVCFSTATS=plot-vcfstats

if [[ "$RUNINDOCKER" -eq "1" ]]; then
echo "RUNNING IN DOCKER"
DRUN="docker run --rm -v $PWD:/data --workdir /data -i"
#--user=biodocker

SAMTOOLS_IMAGE=biodckr/samtools
BWA_IMAGE=biodckr/bwa
TABIX_IMAGE=biodckrdev/htslib:1.2.1
BCFTOOLS_IMAGE=biodckr/bcftools


docker pull $SAMTOOLS_IMAGE
docker pull $BWA_IMAGE
docker pull $TABIX_IMAGE
docker pull $BCFTOOLS_IMAGE

SAMTOOLS="$DRUN $SAMTOOLS_IMAGE $SAMTOOLS"
BWA="$DRUN $BWA_IMAGE $BWA"
TABIX="$DRUN $TABIX_IMAGE $TABIX"
BCFTOOLS="$DRUN $BCFTOOLS_IMAGE $BCFTOOLS"
PLOTVCFSTATS="$DRUN $BCFTOOLS_IMAGE $PLOTVCFSTATS"
else
echo "RUNNING LOCAL"
fi

HEADLEN=100000

if [[ ! -f "$FQ1" ]]; then
curl ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR507/SRR507778/SRR507778_1.fastq.gz| gzip -d | head -$HEADLEN > $FQ1.tmp && mv $FQ1.tmp $FQ1
fi

if [[ ! -f "$FQ2" ]]; then
curl ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR507/SRR507778/SRR507778_2.fastq.gz| gzip -d | head -$HEADLEN > $FQ2.tmp && mv $FQ2.tmp $FQ2
fi

if [[ ! -f "$REF" ]]; then
curl ftp://ftp.ensembl.org/pub/current_fasta/saccharomyces_cerevisiae/dna/Saccharomyces_cerevisiae.R64-1-1.dna_sm.toplevel.fa.gz | gunzip -c > $REF.tmp && mv $REF.tmp $REF
fi

if [[ ! -f "$REF.fai" ]]; then
$SAMTOOLS faidx $REF
fi

if [[ ! -f "$REF.bwt" ]]; then
$BWA index $REF
fi

if [[ ! -f "$BNM.sam" ]]; then
$BWA mem -R '@RG\tID:foo\tSM:bar\tLB:library1' $REF $FQ1 $FQ2 > $BNM.sam.tmp && mv $BNM.sam.tmp $BNM.sam
fi

if [[ ! -f "$BNM.bam" ]]; then
#$SAMTOOLS sort -O bam -T /tmp -l 0 --input-fmt-option SAM -o $BNM.tmp.bam $BNM.sam && mv $BNM.tmp.bam $BNM.bam
$SAMTOOLS sort -O bam -T /tmp -l 0 -o $BNM.tmp.bam $BNM.sam && mv $BNM.tmp.bam $BNM.bam
fi

if [[ ! -f "$BNM.cram" ]]; then
$SAMTOOLS view -T $REF -C -o $BNM.tmp.cram $BNM.bam && mv $BNM.tmp.cram $BNM.cram
fi

if [[ ! -f "$BNM.P.cram" ]]; then
$BWA mem $REF $FQ1 $FQ2 | \
$SAMTOOLS sort -O bam -l 0 -T /tmp - | \
$SAMTOOLS view -T $REF -C -o $BNM.P.tmp.cram - && mv $BNM.P.tmp.cram $BNM.P.cram
fi

#if [[ ! -f "" ]]; then
#$SAMTOOLS view $BNM.cram
#fi

#if [[ ! -f "" ]]; then
#$SAMTOOLS mpileup -f $REF $BNM.cram
#fi

if [[ ! -f "$BNM.vcf.gz" ]]; then
$SAMTOOLS mpileup -ugf $REF $BNM.bam | $BCFTOOLS call -vmO z -o $BNM.vcf.gz.tmp && mv $BNM.vcf.gz.tmp $BNM.vcf.gz
fi

if [[ ! -f "$BNM.vcf.gz.tbi" ]]; then
$TABIX -p vcf $BNM.vcf.gz
fi

if [[ ! -f "$BNM.vcf.gz.stats" ]]; then
$BCFTOOLS stats -F $REF -s - $BNM.vcf.gz > $BNM.vcf.gz.stats.tmp && mv $BNM.vcf.gz.stats.tmp $BNM.vcf.gz.stats
fi

mkdir plots &>/dev/null || true

#if [[ ! -f "plots/tstv_by_sample.0.png" ]]; then
#$PLOTVCFSTATS -p plots/ $BNM.vcf.gz.stats
#fi

if [[ ! -f "$BNM.vcf.filtered.gz" ]]; then
$BCFTOOLS filter -O z -o $BNM.vcf.filtered.gz -s LOWQUAL -i'%QUAL>10' $BNM.vcf.gz
fi

Saturday, 28 November 2015

Protein identification with Comet, PeptideProphet and ProteinProphet using BioDocker containers


Proteomics data analysis is dominated by database-based search engines strategies. Perhaps the most common protocol today is to retrieve raw data from a mass spectrometry, convert the raw data from binary format to a text-based format and then process it using a database search algorithm. The resulting data need to be statistically filtered in order to converge to a final list of identified peptides and proteins.

Amount Search Engines, Comet (the youngest son of SEQUEST) is one of the most popular nowadays. Today we are going to show how to run a simple analysis protocol using the Comet database search engine followed by statistical analysis using PeptideProphet and ProteinProphet, two of the most known and robust processing algorithms for proteomics data.

This pipeline is available in TPP, however several users prefer to use the individual components rather than Trans-proteomics Pipeline.  The big differential here is how we are going to do it. Instead of going through the step-by-step in how to install and configure Comet and TPP, we are going to run the pipeline using Docker containers from the BioDocker project (you can get more information on the project here).

Wednesday, 21 October 2015

Installing MESOS in your Mac

1- Homebrew is an open source package management system for the Mac that simplifies installation of packages from source.

ruby -e "$(curl -fsSL https://raw.github.com/Homebrew/homebrew/go/install)"

2- Once you have Homebrew installed, you can install Mesos on your laptop with these two commands:

brew update
brew install mesos

You will need to wait while the most current, stable version of Mesos is downloaded, compiled, and installed on your machine. Homebrew will let you know when it’s finished by displaying a beer emoji in your terminal and a message like the following:

/usr/local/Cellar/mesos/0.19.0: 83 files, 24M, built in 17.4 minutes
Start Your Mesos Cluster

3- Running Mesos in your machine: Now that you have Mesos installed on your laptop, it’s easy to start your Mesos cluster. To see Mesos in action, spin up an in-memory master with the following command:

/usr/local/sbin/mesos-master --registry=in_memory --ip=127.0.0.1

A Mesos cluster needs at least one Mesos Master to coordinate and dispatch tasks onto Mesos Slaves. When experimenting on your laptop, a single master is all you need. Once your Mesos Master has started, you can visit its management console: http://localhost:5050



Since a Mesos Master needs slaves onto which it will dispatch jobs, you might also want to run some of those. Mesos Slaves can be started by running the following command for each slave you wish to launch:

sudo /usr/local/sbin/mesos-slave --master=127.0.0.1:5050


Tuesday, 6 October 2015

The end of the big files nightmare in Github

One of the nightmares of Github was always the Big Files.  The previous limit in 100MB made difficult to test applications with real examples demanding a lot of work to create "dummy" test files. Today the official announcement of GitHub:

Git LFS is an open source Git extension that we released in April for integrating large binary files into your Git workflow. Distributed version control systems like Git have enabled new and powerful workflows, but they haven’t always been practical for versioning large files.
Git LFS solves this problem by replacing large files with text pointers inside Git, while storing the file contents on a remote server like GitHub.com. 
New git lfs fetch and git lfs pull commands that download objects much faster than the standard Git smudge filter
Options for customizing what files are automatically downloaded on checkout
Selectively ignore a directory of large files that you don’t need for daily work
Download recent files from other branches
Improvements to git lfs push that filter the number of commits to scan for eligible LFS objects to upload. This greatly reduces the time to push new feature branches
A Windows installer and Linux packages for more convenient installation
An experimental extension system for teams that want to customize how objects are stored on the server
Git LFS is now available to all users on GitHub.com, just install the client to get started.
I just added my first big file with this steps:

1 - Download the git plugin from here https://git-lfs.github.com/ or using Homebrew
   brew install git-lfs

2- Select the file types you'd like Git LFS to manage (or directly edit your .gitattributes). You can configure additional file extensions at anytime.

git lfs track "*.psd"
3 -There is no step three. Just commit and push to GitHub as you normally would.

git add file.psd
git commit -m "Add design file"
git push origin master
 
Done !!!!

Wednesday, 30 September 2015

First Scrum Board

Here, my first Scrum board to guide the release of OmicsDI project.

Team members update the task board continuously each sprint; if someone thinks of a new task (“test a new machine learning algorithm”), she writes a new card and puts it on the wall. Either during or before the daily scrum, estimates are changed (up or down), and cards are moved around the board.

Each row on the Scrum board is a user story, which is the unit of work we encourage teams to use for their product backlog.

During the sprint planning meeting, the team selects the product backlog items they can complete during the next Spring. Each product backlog item is turned into multiple sprint backlog items. Each of these is represented by one task card that is placed on the Scrumboard.

  • Story (User Story): The story description (“As a user we want to…”) shown on that row.
  • Ongoing:  Any card being worked on goes here. The programmer who chooses to work on it moves it over when she's ready to start the task. Often, this happens during the daily scrum when someone says, “I'm going to work on the boojum today.”
  • Testing: A lot of tasks have corresponding test task cards. So, if there's a “Code the boojum class” card, there is likely one or more task cards related to testing: “Test the boojum”, “Write FitNesse tests for the boojum,” “Write FitNesse fixture for the boojum,” 
  • Done: Cards pile up over here when they're done. They're removed at the end of the sprint. Sometimes we remove some or all during a sprint if there are a lot of cards.

Optionally, depending on the team, the culture, the project and other considerations:
  • Notes: Just a place to jot a note or two.
  • Tests Specified: We like to do “Story Test-Driven Development,” or “Acceptance Test-Driven Development,” which means the tests are written before the story is coded. Many teams find that it helps to have acceptance tests identified before coding begins on a particular story. This column just contains a checkmark to indicate the tests are specified.