Understanding Apache Spark’s Fold operation

As described on Apache Spark’s website the fold definition states

“Aggregate the elements of each partition, and then the results for all the partitions, using a given associative function and a neutral “zero value”.”

Thus fold operates by folding first each element of partition and then the result of partition
it is partition dependent

let’s examine the below code

JavaRDD jrdd=jsc.parallelize(Arrays.asList("FirstFile"));
String resultfold = jrdd.fold("0",new Function2() {

public String call(String arg0, String arg1) throws Exception {
// TODO Auto-generated method stub
System.out.println(" the arg0 .. >> "+arg0);
System.out.println(" the arg1 .. >> "+arg1);
return arg0+","+arg1;
}
});

gives output
0,0,0,FirstFile

arg0 is assigned with value 0 initially for a partition
arg1 with value FirstFile
resulting in 0,FirstFile –> as aggregation for one partition
In more than one core machine when number of threads given to spark are more than 1 the above RDD will have more than one partition since an empty partition ends up folding down to zero element we will get bepw values for empty partitions
arg0 is 0
arg1 is 0
result is 0,0

Once each partition is aggregated the result of all the partitions will be aggregated using the same function and supplied value resulting in
0,0,0,FirstFile

The above result may wary each time we execute it depending on which partition gets executed first

Let’s re-partition the above rdd and set the number of partition count to 1 explicitly

JavaRDD jdpar = jrdd.repartition(1);
String resultfold = jdpar.fold("0",new Function2() {

public String call(String arg0, String arg1) throws Exception {
// TODO Auto-generated method stub
System.out.println(" the arg0 .. >> "+arg0);
System.out.println(" the arg1 .. >> "+arg1);
return arg0+","+arg1;
}
});

Since there is a single partition with value FirstFile

1st iteration
arg0 0
arg1 FirstFile

2nd iteration
arg0 0
arg1 0,FirstFile
result 0,0,FirstFile