Azrael
Azrael

Reputation: 700

Does spark handle data shuffling?

I have an input A which I convert into an rdd X spread across the cluster.

I perform certain operations on it.

Then I do .repartition(1) on the output rdd.

Will my output rdd be in the same order that input A.

Does spark handle this automatically? If yes, then how?

Upvotes: 0

Views: 77

Answers (1)

Alexey Romanov
Alexey Romanov

Reputation: 170713

The documentation doesn't guarantee that order will be kept, so you can assume it won't be. If you look at the implementation, you'll see it certainly won't be (unless your original RDD already has 1 partition for some reason): repartition calls coalesce(shuffle = true), which

Distributes elements evenly across output partitions, starting from a random partition.

Upvotes: 1

Related Questions