Friday, 11 September 2015

An API for all MS-based File formats

We recently released and published our first Java API (Application Programming Interface) for the most common file formats in proteomics, not only ms files but also identification files such as mzIdentML and mztab. 

ms-data-core-api (https://github.com/PRIDE-Utilities/ms-data-core-api)

The library allow the end-users and the developers to use a common data structure for proteomics independently of the file types, and .. But first lets try to understand what is a API.

What is an API?

Imagine you are a builder or civil engineering and your are building your bridge, different components, blocks and different teams needs to be coordinated and plugged for the final results. Wrong communications between the members of the teams, different block sizes or building plans only produced strange results. 

In the simplest terms, APIs are sets of requirements, data structures, objects that govern how applications and software components can talk each other. An API, is a set of routines and protocols that provide building blocks for computer programmers and web developers to build software applications. In the past, APIs were largely associated with computer operating systems and desktop applications. In recent years though, we have seen the emergence of Web APIs (Web Services).


What is ms-data-core-api?

The ms-data-core-api is a free, open-source library for developing computational proteomics tools and pipelines. The Application Programming Interface, written in Java, enables rapid tool creation by providing a robust, pluggable programming interface and common data model. The data model is based on controlled vocabularies/ontologies and captures the whole range of data types included in common proteomics experimental workflows, going from spectra to peptide/protein identifications to quantitative results. 

The library contains readers for three of the most used Proteomics Standards Initiative standard file formats: mzML, mzIdentML, and mzTab. In addition to mzML, it also supports other common mass spectra data formats: dta, ms2, mgf, pkl, apl (text-based), mzXML and mzData (XML-based). Also, it can be used to read PRIDE XML, the original format used by the PRIDE database, one of the world-leading proteomics resources. Finally, we present a set of algorithms and tools whose implementation illustrates the simplicity of developing applications using the library.


How can is used it to analyse my data? 

The library can be use to generate customised reports and different analysis in the MS-based results. Here some simple examples: 

/** Create an mzIdentML file **/ 
MzIdentMLControllerImpl mzIdentMlController = new MzIdentMLControllerImpl(inputFile, true);

/** Create the Spectra files **/
List<File> fileList = new ArrayList<File>();
fileList.add(filems);

/** Add MS files **/         
mzIdentMlController.addMSController(fileList);


/** Read the spectra **/

for(Comparable id: mzIdentMlController.getSpectrumIds()){
      Spectrum spectrum = mzIdentMlController.getSpectrumById(id);
      if(spectrum != null)
           System.out.println("Spectrum id: " + id + " Number of Peaks: " + spectrum.getMassIntensityMap().length);
}

/** Print Peptide sequence for all Peptides for every protein identification **/

Collection<Comparable> proteinIds = mzIdentMlController.getProteinIds();
for(Comparable proteinID: proteinIds){
     Collection<Comparable> peptideIds = mzIdentMlController.getPeptideIds(proteinID);
     for(Comparable peptideId: peptideIds)
  System.out.println(mzIdentMlController.getPeptideSequence(proteinID,peptideId));
}





The previous example allow the user to get the sequence of all identified peptides from an mzIdentML. 

Can I process the files using the API?  

Different functionalities are provided in the API to process and analyse the results. The present example compute the isoelectric point of all the proteins identified in the mzIdentML:

MzIdentMLControllerImpl mzIdentMlController new MzIdentMLControllerImpl(inputFile, true);
Collection<Comparable> proteinIds = mzIdentMlController.getProteinIds();
   for(Comparable proteinID: proteinIds){
          Collection<Comparable> peptideIds = mzIdentMlController.getPeptideIds(proteinID);
          for(Comparable peptideId: peptideIds)
             System.out.println(IsoelectricPointUtils.calculate(mzIdentMlController.getPeptideSequence(proteinID,peptideId)));
   }




Different functionalities are included in the present version:  protein inference, mass calculation, property calculation. 

Here another example about how to compute the protein inference. 

Can you work for me or can I contribute with the library?