Reputation: 557
This might be a duplicate of this question
Since it was already answered, creating a new question for more visibility. If that is not the right way, I can move my question as a comment to the above question.
According to the answer, all data for a key resides in a single partition. But, this answer from spark mail group, says different. link to spark dev group
It is not necessary if you are using bucketing available in Spark 2.0. For partitioning, it is still necessary because we do not assume each partition is small, and as a result there is no guarantee all the records for a partition end up in a single Spark task partition.
Can someone please confirm whether all data for a partition ends up in a single task partition or not. Any links to doc or source code would be very helpful.
Upvotes: 2
Views: 2390
Reputation: 330423
These two posts discuss different problems.
The developers list thread discuss the problem of reading data from partitioned data source, i.e. one written with partitionBy
:
val df: DataFrame = ???
df.write.partitionBy("foo").saveAsTable("some_table")
When reading data like this a single on-disk "partition" can be loaded into multiple partitions so it cannot be used to optimize execution plan.
In contrast Does spark keep all elements of an RDD[K,V] for a particular key in a single partition after "groupByKey" even if the data for a key is very huge? discusses the problem of partitioning PairRDDs
and it is not related to Spark SQL or data loading process. When you use:
val rdd: RDD[(T, U)] = ???
val partitioner: Partitioner = ???
rdd.partitionBy(partitioner)
all values for a particular key will be shuffled to a single partition.
Upvotes: 1