What is MapReduce and how it works

We usually develop programs based on open sourced MapReduce frameworks such as Hadoop, Apache Pig, Apache Hive, and Spark to solve Big Data problems. In this post, I will use an example to describe what MapReduce is and how it works. I hope this will help you learn those Big Data technologies such as Hadoop, Pig, Hive and Spark easier.

What is MapReduce?

MapReduce is the key of Big Data. It was invented by Google, and it is the heart of Hadoop.

It is a programming paradigm that allows engineers or scientist build scalable systems that can run on hundreds or thousands of servers.

In this post, I will use an example to describe how MapReduce works.

Understand MapReduce

The key to understand MapReduce concept is to think about how you want many workers or people to collaborate with each other to accomplish a task that can not be done by one person.

For instance, suppose you have 1 million items that are scattered across the floor in a room. You want to reorganize them and put them on the shelves based on their codes or Ids. Apparently, you need many people to help you. Now suppose you have 500 people, how will you finish the task?

 There are two methods:
  1. Each person work independently to fetch items and put them to the shelf. This method will not work if the room is not spacious enough as the people have to walk back and forth.
  2. Divide the people into two groups. The first group of people will fetch the items and put them into a corresponding box based on their IDs. Suppose we give each person an ID from 1 to 500. The person whose Id is larger than 100 is assigned to the first group, who are instructed to pick the item, and put the item into a box. The remaining people are assigned to the second group. Each person in the second group will be assigned a box number. Then all the items in that box will be handled by him.

    We also need a driver, whose main job is to assign the box to the corresponding person in the second group to handle the items.For instance,  suppose all the people in the first group can follow this rule:  For all them items, if its ID ends with  xy, the item will be put into box xy.  So all the items with id like XXX123456 will be put into box 56. Then the person who is ID is 56 will be instructed to handle the items in box 56. In summary, we give each person in the first group 100 boxes, their job is to put the items to the corresponding boxes based on the last two digits of their IDs. Then the driver will rearrange the box, so the box with the same number will be put together, and the corresponding person, whose ID is the same as the ID of the box, in the second group will be asked to handled the items in the box.

Now let’s assume the items are words, the room is actually a large text file with billions of pages,  the people are computers and the driver is a computer for resource allocation and management. The task is to count the frequency of each word in the file.

It is straight forward to develop a program to do the word counting based on HashTable that runs on a single machine, however, it is too slow to finish the task if the file is too large.

If we adopt the second method described in the above item reorganization example, we can develop a method that can be scaled to use thousands of machines. Similarly, we also divide the computers into two groups. The computers in the first group will read a part of the file, assign each word a number (e.g., using Hash function), then give the word along with its number to the driver, the driver will deliver the word to the corresponding computer in the second group based on the IP.

In the terminology of MapReduce, the computers in the first group are called Mappers, the computers in the second group are called Reducers. The following figure demonstrate the idea of using MapReduce to do word counting:

MapReduce for word count

How MapReduce Works

Suppose the file contains the following texts:

Suppose the file is divided into 3 parts (e.g., each part is one line), and we have two Mappers . And assume mapper one will handle part 1, and mapper 2 will handle part 2 and part 3.  Each mapper reads the input data, and project a tuple like (word, 1) .

How Mapper works

A simple Mapper is like this:

The output for Mapper one will be:

The output from Mapper 2 will be:

The output from the mappers are sent to the MapReduce framework (e.g., Hadoop). It will rank the items such that the words with the same Hash value are grouped together and sent to the same Reducer. This process is called Shuffle.

How Reducer works

Suppose we have three Reducers numbered 0, 1, and 2, and the MapReduce framework sends Word Wi to Machine m if Hash(Wi) % 3 == m.

Suppose Reducer 1 received the following items:

A Simple Reducer Program

Please be noted that, the MapReduce framework will sorted the items such that the items with the same key are near each other. So the Reducer program can be like this:

The outcome will be:


You may observe that the word “data” occurs twice in Mapper 1. Send the resulted items twice to the MapReduce framework is not necessary in our case. This can be harmful if there are too many duplicated items in the same mapper.

We can reduce the data transferred to the network by Combine the items with the same key. Which means for word "data", the generated output item from Mapper 1 will be ("data", 2)

In practice, it is always a good idea to take advantage combiner to reduce the data passed through the network.