As described on Apache Spark’s website the fold definition states
“Aggregate the elements of each partition, and then the results for all the partitions, using a given associative function and a neutral “zero value”.”
Thus fold operates by folding first each element of partition and then the result of partition
it is partition dependent
let’s examine the below code
JavaRDD
String resultfold = jrdd.fold("0",new Function2
public String call(String arg0, String arg1) throws Exception {
// TODO Auto-generated method stub
System.out.println(" the arg0 .. >> "+arg0);
System.out.println(" the arg1 .. >> "+arg1);
return arg0+","+arg1;
}
});
gives output
0,0,0,FirstFile
arg0 is assigned with value 0 initially for a partition
arg1 with value FirstFile
resulting in 0,FirstFile –> as aggregation for one partition
In more than one core machine when number of threads given to spark are more than 1 the above RDD will have more than one partition since an empty partition ends up folding down to zero element we will get bepw values for empty partitions
arg0 is 0
arg1 is 0
result is 0,0
Once each partition is aggregated the result of all the partitions will be aggregated using the same function and supplied value resulting in
0,0,0,FirstFile
The above result may wary each time we execute it depending on which partition gets executed first
Let’s re-partition the above rdd and set the number of partition count to 1 explicitly
JavaRDD
String resultfold = jdpar.fold("0",new Function2
public String call(String arg0, String arg1) throws Exception {
// TODO Auto-generated method stub
System.out.println(" the arg0 .. >> "+arg0);
System.out.println(" the arg1 .. >> "+arg1);
return arg0+","+arg1;
}
});
Since there is a single partition with value FirstFile
1st iteration
arg0 0
arg1 FirstFile
2nd iteration
arg0 0
arg1 0,FirstFile
result 0,0,FirstFile