Map Reduce Sort Shuffle Phase (Part 1)

Shuffle and Sort phase is a hand off process happens after the map completes and before reduce phase begins.

Data from mapper are moved to nodes where the reduce task will be run.When the mapper task completes the o/p is sorted and partitioned according to the number of reduce tasks defined and then written to disk.Data is made immediately available to reducers as soon as the output is available for a record from a map task rather than waiting for the last map task to complete,Although reducers will have lot of mapper tasks o/p in their memory but they cant execute it until the all mapper tasks are finished.Thus the processing speed will be controlled by the slowest mapper task,to avoid such a scenario Hadoop implements speculative execution concept

Cases where a mapper task is running slower than a reasonable amount of time the application master(Jobtracker ) will spawn a duplicate mapper tasks ,whichever task is finished first the o/p is stored on disk and the other is killed

The o/p’s are stored on local disk of the nodes where the mapper tasks were running not on HDFS