Reputation: 27385
I have a question about coalesce. It's not quite clear the side effect of it. I have the following RDD:
JavaRDD<String> someStrings = //...
JavaRDD<String> coalescedStrings = someStrings.coalesce(100, false); //descreasing
So, what actually happened? If I operate on someStrings
with some operations is it going to affect coalescedStrings
?
Upvotes: 2
Views: 5555
Reputation: 3250
The coalesce method reduces the number of partitions in a DataFrame. It will not affect coalescedStrings whatever operation you operate on someStrings.
Upvotes: 1
Reputation: 37832
So, what actually happened?
First of all, since coalesce
is a Spark transformation (and all transformations are lazy), nothing happened, yet. No data was read and no action on that data was taken. What did happen - a new RDD (which is just a driver-side abstraction of distributed data) was created. This new RDD is a set of instructions for reading / transforming data, which is identical to the set of instructions called someStrings
, except it contains one more "instruction": to repartition the data into 100 partitions. Actions / transformation on that new RDD (coalescedStrings
) would use 100 partitions (which would translate into 100 tasks per stage) to perform any processing, unlike operations on someStrings
that would use the original partition count. So the two RDDs would contain the same data (if operated upon), but partitioned differently.
If I operate on
someStrings
with some operations is it going to affectcoalescedStrings
?
No, the two RDDs are completely* independent of each other - actions on one would not affect the other. someStrings
still has its original number of partitions.
* this has some exceptions, mostly where it comes to caching: for example, if at any stage of its calculation, someStrings
was cached, and you operate on someStrings
before your operate on coalescedStrings
- then subsequent operations on coalescedStrings
would be able to use the cached results and continue from there.
Upvotes: 6