St.Antario
St.Antario

Reputation: 27385

Understanding coalesce in Spark

I have a question about coalesce. It's not quite clear the side effect of it. I have the following RDD:

JavaRDD<String> someStrings = //...
JavaRDD<String> coalescedStrings = someStrings.coalesce(100, false); //descreasing

So, what actually happened? If I operate on someStrings with some operations is it going to affect coalescedStrings?

Upvotes: 2

Views: 5555

Answers (2)

Mahesh Chand
Mahesh Chand

Reputation: 3250

The coalesce method reduces the number of partitions in a DataFrame. It will not affect coalescedStrings whatever operation you operate on someStrings.

Upvotes: 1

Tzach Zohar
Tzach Zohar

Reputation: 37832

So, what actually happened?

First of all, since coalesce is a Spark transformation (and all transformations are lazy), nothing happened, yet. No data was read and no action on that data was taken. What did happen - a new RDD (which is just a driver-side abstraction of distributed data) was created. This new RDD is a set of instructions for reading / transforming data, which is identical to the set of instructions called someStrings, except it contains one more "instruction": to repartition the data into 100 partitions. Actions / transformation on that new RDD (coalescedStrings) would use 100 partitions (which would translate into 100 tasks per stage) to perform any processing, unlike operations on someStrings that would use the original partition count. So the two RDDs would contain the same data (if operated upon), but partitioned differently.

If I operate on someStrings with some operations is it going to affect coalescedStrings?

No, the two RDDs are completely* independent of each other - actions on one would not affect the other. someStrings still has its original number of partitions.

* this has some exceptions, mostly where it comes to caching: for example, if at any stage of its calculation, someStrings was cached, and you operate on someStrings before your operate on coalescedStrings - then subsequent operations on coalescedStrings would be able to use the cached results and continue from there.

Upvotes: 6

Related Questions