Category Archives: Map Reduce

Map Reduce Shuffle Sort Phase (Part 2)

This post is in the continuation with the Map Reduce Shuffle Sot Phase (Part 1),so it is clear that the mapper passes (K,V) key/value pair to reducer since sorting,grouping and
partitioning took place in Sort Shuffle Phase the basic requirement is
to have a way to compare the keys in order to sort,partition and group
them for this purpose the key class must implement
WritableComparable interface and value class should
be of a Writable type.

The method that is being called is

public int compareTo(WritableComparable o){}

Based on the result of this method sorting is done by default the sorting is done on compareTo method of the key class

Other than sorting, shuffle-sort phase is also used to distribute the mapper values amongst different reducers. This is done by partitioning the key values. Partitioner decides which {K, V} pair should be assigned to which reducer for processing. It is implemented by using a subclass of the Partitioner class. This class has a method:

public int getPartitions(K key, V value,int numReduceTasks) {}

While partitioning decides the reduce task, grouping decides which {K, V} pairs should be clubbed up in one reduce call. Each reduce task’s data is divided into various groups and reduce () method is called for each group (multiple reduce method calls may be made within one reduce task. Number of groups = number of reduce calls)

To make surethat the same keys are grouped in one group,The group comparator class should extend from WritableComparator class and override its compareTo() method. While grouping, {K, V} pairs are considered to be in the same group till the compareTo() call returns a 0. A new group is formed once a non-zero value is returned from the group comparator. This also implies that the keys should be sorted prior to grouping (which is already taken care in the sort phase); else grouping may emit incorrect results. By default, the key class’ compareTo() that was used for sorting is also used for group comparison.

Map Reduce Sort Shuffle Phase (Part 1)

Shuffle and Sort phase is a hand off process happens after the map completes and before reduce phase begins.

Data from mapper are moved to nodes where the reduce task will be run.When the mapper task completes the o/p is sorted and partitioned according to the number of reduce tasks defined and then written to disk.Data is made immediately available to reducers as soon as the output is available for a record from a map task rather than waiting for the last map task to complete,Although reducers will have lot of mapper tasks o/p in their memory but they cant execute it until the all mapper tasks are finished.Thus the processing speed will be controlled by the slowest mapper task,to avoid such a scenario Hadoop implements speculative execution concept

Cases where a mapper task is running slower than a reasonable amount of time the application master(Jobtracker ) will spawn a duplicate mapper tasks ,whichever task is finished first the o/p is stored on disk and the other is killed

The o/p’s are stored on local disk of the nodes where the mapper tasks were running not on HDFS