The RDD’s in spark are partitioned, using Hash Partitioner by default. Co-partitioned RDD’s uses same partitioner and thus have their data distributed across partitions in same manner.
val data = Array(1, 2, 3, 4, 5)
val rdd1= sc.parallelize(data,10)
val data2 = Array(5,8,9,10,2)
rdd2=sc.parallelize(data2,10)
In both of the above defined RDD’s ,same partitioner is used i.e HashPartitioner. HashPartitioner will partition the data in the same way for both RDD’s,same data values in two different RDD will give same Hashvalue. As the number of partitiones specified is also same. These co-partitioned RDD’s reduces the shuffling in network to a great extent. As all the keys required for keyBy transformations will be present in two same partitions of two different RDD’s.
Co-grouping utilizes concept of Co-Partitioning to provide efficient performance improvement when multiple RDD’s are to be joined, over using join again and again. As with every join operation the destination RDD will either have supplied or default value of partitions and the join may or may not require shuffling of two RDD’s that are to be joined based on, if they were co-partitioned and had same number of partitions.
rdd3=rdd1.join(rdd2)
Since rdd1 and rdd2 used same partitioner and also had same number of partitions, the join operation that produces rdd3 will not require any shuffle. But if rdd1 and rdd2 had different number of partitions than the content of rdd with small number of partitions would have been reshuffled.Since number of partitions are not specified, the will depend on default configuration.
Performing another join using rdd3 and rdd4 to create rdd5 will lead to chances of more shuffling. All these shuffling and expensive operations can be avoided by using cogroup when we have multiple RDD’s to be joined.
rdd5=rdd1.cogroup(rdd2,rdd3)
As the cogroup will create co-partitioned RDD’s