Resources for Learning Big Data

Big Data is a really hot topic! There are millions of job openings. This post aims to introduce some of the most important techniques related to Big Data.  If you want to learn Big Data and find a job related to Big Data, please read the articles and books recommended here. 

The first key concept of Big Data is MapReduce, which is the core of Hadoop, a open sourced framework that becomes the foundation of the Big Data eco system. 

  • MapReduce was invented by Google. It’s a paradigm for writing distributed systems inspired by some elements of functional programming. Please refer to this article if you do not understand MapReduce. The Google internal implementation is called MapReduce and Hadoop is it’s open-source implementation. Amazon’s Hadoop instance is called Elastic MapReduce (EMR) and has plugins for multiple languages.
  • HDFS is an implementation inspired by the Google File System (GFS) to store huge data across many machines when it’s too big for one. Hadoop consumes data in HDFS (Hadoop Distributed File System).
  • Apache Spark is an emerging platform that has more flexibility than MapReduce but more structure than a basic message passing interface. It relies on the concept of distributed data structures (what it calls RDDs) and operators. See this page for more:The Apache Software Foundation
  • Google BigTable and it’s open source twin HBase were meant to be read-write distributed databases, originally built for the Google Crawler that sit on top of GFS/HDFS and MapReduce/Hadoop. The research paper: BigTable
  • It is often hard to write MapReduce programs, especially if you are not good at programming. Hive and Pig are abstractions on top of Hadoop designed to write Sql like programs to analyze large scale data set easier.  The Hive and Pig programs will be translated into MapReduce jobs that are running on top of Hadoop. 
  • Oozie is a workflow scheduler. Simply speaking, you can use Oozie to build a pipeline by puts all the the MapReduce, Hive, Pig, and even raw Java programs together.  Say, you have a large set of data stored in  HDFS, you use Pig load the data to do some data transformation, then you stored the data to Hive table, which will be read by a Hive job to conduct further analysis. The output from the Hive job will be used as the input to train a model using Spark MLlib.  You can use Oozie to define the dependency graph of these jobs using a configure file, so that the jobs can run one after another based on the configurations. 


Here are some books that are worth to read:

  1. Hadoop: The Definitive Guide
  2. Learning Spark: Lightning-Fast Big Data Analysis
  3. Data-Intensive Text Processing with MapReduce
  4. Programming Pig
  5. Programming Hive