Understanding coalesce in Spark

Question

I have a question about coalesce. It's not quite clear the side effect of it. I have the following RDD:

JavaRDD someStrings = //...
JavaRDD coalescedStrings = someStrings.coalesce(100, false); //descreasing

So, what actually happened? If I operate on someStrings with some operations is it going to affect coalescedStrings?

Tzach Zohar · Accepted Answer

So, what actually happened?

First of all, since coalesce is a Spark transformation (and all transformations are lazy), nothing happened, yet. No data was read and no action on that data was taken. What did happen - a new RDD (which is just a driver-side abstraction of distributed data) was created. This new RDD is a set of instructions for reading / transforming data, which is identical to the set of instructions called someStrings, except it contains one more "instruction": to repartition the data into 100 partitions. Actions / transformation on that new RDD (coalescedStrings) would use 100 partitions (which would translate into 100 tasks per stage) to perform any processing, unlike operations on someStrings that would use the original partition count. So the two RDDs would contain the same data (if operated upon), but partitioned differently.

If I operate on someStrings with some operations is it going to affect coalescedStrings?

No, the two RDDs are completely* independent of each other - actions on one would not affect the other. someStrings still has its original number of partitions.

* this has some exceptions, mostly where it comes to caching: for example, if at any stage of its calculation, someStrings was cached, and you operate on someStrings before your operate on coalescedStrings - then subsequent operations on coalescedStrings would be able to use the cached results and continue from there.

Understanding coalesce in Spark

Answers (2)

Related Questions