Resources for Learning Big Data
Big Data is a really hot topic! There are millions of job openings. This post aims to introduce some of the most important techniques related to Big Data. If you want to learn Big Data and find a job related to Big Data, please read the articles and books recommended here.
- MapReduce was invented by Google. It’s a paradigm for writing distributed systems inspired by some elements of functional programming. Please refer to this article if you do not understand MapReduce. The Google internal implementation is called MapReduce and Hadoop is it’s open-source implementation. Amazon’s Hadoop instance is called Elastic MapReduce (EMR) and has plugins for multiple languages.
- HDFS is an implementation inspired by the (GFS) to store huge data across many machines when it’s too big for one. Hadoop consumes data in HDFS (Hadoop Distributed File System).
- Apache Spark is an emerging platform that has more flexibility than MapReduce but more structure than a basic message passing interface. It relies on the concept of distributed data structures (what it calls RDDs) and operators. See this page for more:
- Google BigTable and it’s open source twin HBase were meant to be read-write distributed databases, originally built for the Google Crawler that sit on top of GFS/HDFS and MapReduce/Hadoop. The research paper:
- It is often hard to write MapReduce programs, especially if you are not good at programming. Hive and Pig are abstractions on top of Hadoop designed to write Sql like programs to analyze large scale data set easier. The Hive and Pig programs will be translated into MapReduce jobs that are running on top of Hadoop.
- Oozie is a workflow scheduler. Simply speaking, you can use Oozie to build a pipeline by puts all the the MapReduce, Hive, Pig, and even raw Java programs together. Say, you have a large set of data stored in HDFS, you use Pig load the data to do some data transformation, then you stored the data to Hive table, which will be read by a Hive job to conduct further analysis. The output from the Hive job will be used as the input to train a model using Spark MLlib. You can use Oozie to define the dependency graph of these jobs using a configure file, so that the jobs can run one after another based on the configurations.
- MapReduce Online
- Mapreduce paper: https://www.usenix.org/legacy/publications/library/proceedings/osdi04/tech/full_papers/dean/dean_html/
- Running hadoop on Linux
- The wordcount example