Does spark keep all data for a key in a single partition after partitioning by that key

Question

This might be a duplicate of this question

Since it was already answered, creating a new question for more visibility. If that is not the right way, I can move my question as a comment to the above question.

According to the answer, all data for a key resides in a single partition. But, this answer from spark mail group, says different. link to spark dev group

It is not necessary if you are using bucketing available in Spark 2.0. For partitioning, it is still necessary because we do not assume each partition is small, and as a result there is no guarantee all the records for a partition end up in a single Spark task partition.

Can someone please confirm whether all data for a partition ends up in a single task partition or not. Any links to doc or source code would be very helpful.

Does spark keep all data for a key in a single partition after partitioning by that key

Answers (1)

Related Questions