Reputation: 21757
Spark v2.4 no Hive
Spark benefit from bucketBy
in a way that it knows the DataFrame has the correct partitioning. What about sortBy
?
spark.range(100, numPartitions=1).write.bucketBy(3, 'id').sortBy('id').saveAsTable('df')
# No need to `repartition`.
spark.table('df').repartition(3, 'id').explain()
# == Physical Plan ==
# *(1) FileScan parquet default.df2[id#33620L] Batched: true, Format: Parquet, Location: InMemoryFileIndex[df], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:bigint>, # SelectedBucketsCount: 3 out of 3
# Still need to `sortWithinPartitions`.
spark.table('df').sortWithinPartitions('id').explain()
# == Physical Plan ==
# *(1) Sort [id#33620L ASC NULLS FIRST], false, 0
# +- *(1) FileScan parquet default.df2[id#33620L] Batched: true, Format: Parquet, Location: InMemoryFileIndex[df], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:bigint>, SelectedBucketsCount: 3 out of 3
So additional repartition
is omitted. However, sortWithinPartitions
is not. Is sortBy
useful? Can we use sortBy
to speed up table joining at all?
Upvotes: 3
Views: 1070
Reputation: 40380
Short answer : There is no benefits from sortBy
in persistent tables (at the moment at least).
Longer answer :
Spark and Hive do not implement the same semantics or the operational specifications when it comes to bucketing support, although Spark can save bucketed DataFrame into a Hive table.
First, the units of bucketing are different between both frameworks: single bucket file (hive) vs. collection of files per bucket (spark).
Second,
In Hive, each bucket is sorted globally which can optimize queries reading data.
In Spark and till this issue https://issues.apache.org/jira/browse/SPARK-19256 gets (hopefully) resolved, each file is individually sorted but the bucket as a whole is not globally sorted.
Thus, since sorting is not global, there is no benefits form sortBy
.
I hope this answers your question.
Upvotes: 1