Hadoop Map Side Joins

Joins done at mapper side are known as MapSide join while the joins done at reducer side are known as ReduceSide join

Map Side join
*) The two data set to be joined must be sorted based on the same key
*)The two data set to be joined must have same number of partitions
all the keys for any record should be in the same partition
*)Out of the two dataset one must be small enough to fir in memory

If in case the data-sets are not sorted based on the same key we can run a marker Hadoop job that just output’s the field on which
sorting is to be done as the key and by specifying the exact same number of reducer for all the data sets we will have over data ready for
a Map Side join

While performing the map side join the records are merged before they reach the mapper. We will use CompositeInputFormat with the following
configuration to specify
1) Separator for separating the keys and value
2) Join that we will be doing(Inner,Outer…)


Configuration config = new Configuration();
config.set("mapreduce.input.keyvaluelinerecordreader.key.value.separator", separator);//Separator
String joinExpression = CompositeInputFormat.compose("inner", KeyValueTextInputFormat.class, paths);//Key format and paths of file to join
config.set("mapred.join.expr", joinExpression);//join to be done (inner/outer...)

Once the join is done the mapper is called it will receive the values in Text that contains the key and TupleWritable that is composed of the values joined from our input files for a given key.