Saturday, 17 March 2018

Big Data: Is not only a fancy/catchy name

The field of biomedical research has a new trend to use fancy terms in the title of papers/grants in order to attract the attention of reviewers, journals and grant agencies. Amount others are: large-scale, complete map, draft, landscape, deep, full, and Big Data. Figure 1 shows the exponential use of these words in pubmed articles.

Figure 1: Number of mentions of specific terms in pubmed by years.

I will stop here to discuss the term Big data.

What is Big data?

First let's go to Wikipedia, even when this is a "new" term: 

Big data is data sets that are so voluminous and complex that traditional data processing application software are inadequate to deal with them. Big data challenges include capturing datadata storagedata analysis, search, sharingtransfervisualizationquerying, updating, information privacy and data source. There are five concepts associated with big data: volumevarietyvelocity and, the recently added, veracity and value. (Wikipedia)  

The term Big Data has been used since the 1990s. Big data is associated with data beyond the ability of commonly used software tools to process it within a tolerable elapsed time. Then, in my opinion: 
The Big Data term is not only about the size of the data but the amount of time current softwares takes to process, visualize and enable other users to query it.  
The mistake of associating the term big data with the size of the data is common in biomedical literature and it should be seen as a miss-understanding of the term but also of what we supposed to achive by analyzing data in the future.

First, big data is not data size only and this is easily refutable. Data of the order of one Petabyte is considered big data today, soon it will be small data. In the same way that 1 Terabytes of data was big 20 years ago and today is considered small.

Then, How to contextualize and help readers to understdand and define the boundaries of Big Data..  

In 2001, Doug Laney introduced the 3Vs concept: Volume, Variety and Velocity. More recently, additional Vs have been proposed for addition to the model, including Variability -- the increase in the range of values typical of a large data set -- and Value, which addresses the need for valuation of enterprise data.

3Vs in a Diagram
Volume and Data size:  We currently see the exponential growth in the data storage as the data is now more than text data. We can find data in the format of videos, musics and large images on our social media channels. In bioinformatics, RAW files from omics data, seuence Variants and PTMs profiles. It is very common to have Terabytes and Petabytes of the storage system for projects and experiments. As the database grows the applications and architecture built to support the data needs to be reevaluated quite often. Sometimes the same data is re-evaluated with multiple angles and even though the original data is the same the new found intelligence creates explosion of the data.

Velocity: With Velocity we refer to the speed with which data are being generated. Staying with our social media example, every day 900 million photos are uploaded on Facebook, 500 million tweets are posted on Twitter, 0.4 million hours of video are uploaded on Youtube and 3.5 billion searches are performed in Google. This is like a nuclear data explosion. Big Data helps the company to hold this explosion, accept the incoming flow of data and at the same time process it fast so that it does not create bottlenecks.

Variety: Variety in Big Data refers to all the structured and unstructured data that has the possibility of getting generated either by humans or by machines. Variety is all about the ability to classify the incoming data into various categories.

Then, when someone in biomedical research use the term, we need to be sure that we are not talking only about the size of the data but also about the way this data growths and the time its takes to be process it (Velocity). For example, if someone said, we have process 10'000 genomes in one year, that is probably the large study that has been performed until that moment, but that is not of use of big data

It is actually really simple to spot in biomedical research if some study is actually using big data by the technologies the authors used. This is a list of terms that dominates the field of big data analytics nowadays and we should know them as editors, reviewers and as a community: 

Apache tools for Big data: 

Flink: An open-source streaming data processing framework.
Hadoop: An open-source tool to process and store large distributed data sets across machines by using MapReduce.
Spark: An open-source big data processing engine that runs on top of Apache Hadoop, Mesos, or the cloud.

General Terms

Data lake: A storage repository that holds raw data in its native format.
Data processing: The process of retrieving, transforming, analyzing, or classifying information by a machine.
GPU-accelerated databases: Databases which are required to ingest streaming data.
MapReduce: A data processing model that filters and sorts data in the Map stage, then performs a function on that data and returns an output in the Reduce stage.
Real-time stream processing: A model for analyzing sequences of data by using machines in parallel, though with reduced functionality.
Resilient distributed dataset: The primary way that Apache Spark abstracts data, where data is stored across multiple machines in a fault-tolerant way.
Shard: An individual partition of a database.
Stream processing: The real-time processing of data. The data is processed continuously, concurrently, and record-by-record.
Cloud computing: Well, cloud computing has become ubiquitous so it may not be needed here but I included just for completeness sake. It’s essentially software and/or data hosted and running on remote servers and accessible from anywhere on the internet.
Distributed File System: As big data is too large to store on a single system, Distributed File System is a data storage system meant to store large volumes of data across multiple storage devices and will help decrease the cost and complexity of storing large amounts of data.

I have reviewed some of the papers that use the term big data in the title in pubmed and most of them are using it to reference to the combination of multiple resources. For example, the field of interactions and pathway analysis is proficient of using the term when combining multiple source of data .... wrong. Interesting, this review has not mention to any of the terms/technologies related with big data, which demonstrated a big miss-conception of the concept. It is actually interesting that Figure 1 shows no increase in the use of the term Hadoop in biomedical research.  

I have only a couple of examples of tool, algorithms in bioinformatics designed for big data analytics (studies and tools that can be classify as big data resources): 

ADAM: is a library and command line tool that enables the use of Apache Spark to parallelize genomic data analysis across cluster/cloud computing environments. ADAM uses a set of schemas to describe genomic sequences, reads, variants/genotypes, and features, and can be used with data in legacy genomic file formats such as SAM/BAM/CRAM, BED/GFF3/GTF, and VCF, as well as data stored in the columnar Apache Parquet format. 

Hail: Hail is an open-source, scalable framework for exploring and analyzing genomic data. Starting from genetic data in VCF, BGEN or PLINK format.

Other Good Examples: 

Spectra Cluster: The spectra-cluster-hadoop application is used to cluster massively amount of mass spectra using Hadoop Technology. 

SparkBlast: SparkBLAST is a parallelization of a sequence alignment application (BLAST) that employs cloud computing for the provisioning of computational resources and Apache Spark as the coordination framework.

Feel free to add more comments and tools here.