colinfang
colinfang

Reputation: 21757

Does Spark benefit from `sortBy` in persistent table?

Spark v2.4 no Hive

Spark benefit from bucketBy in a way that it knows the DataFrame has the correct partitioning. What about sortBy?

spark.range(100, numPartitions=1).write.bucketBy(3, 'id').sortBy('id').saveAsTable('df')

# No need to `repartition`.
spark.table('df').repartition(3, 'id').explain()
# == Physical Plan ==
# *(1) FileScan parquet default.df2[id#33620L] Batched: true, Format: Parquet, Location: InMemoryFileIndex[df], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:bigint>, # SelectedBucketsCount: 3 out of 3

# Still need to `sortWithinPartitions`.
spark.table('df').sortWithinPartitions('id').explain()
# == Physical Plan ==
# *(1) Sort [id#33620L ASC NULLS FIRST], false, 0
# +- *(1) FileScan parquet default.df2[id#33620L] Batched: true, Format: Parquet, Location: InMemoryFileIndex[df], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:bigint>, SelectedBucketsCount: 3 out of 3

So additional repartition is omitted. However, sortWithinPartitions is not. Is sortBy useful? Can we use sortBy to speed up table joining at all?

Upvotes: 3

Views: 1070

Answers (1)

eliasah
eliasah

Reputation: 40380

Short answer : There is no benefits from sortBy in persistent tables (at the moment at least).

Longer answer :

Spark and Hive do not implement the same semantics or the operational specifications when it comes to bucketing support, although Spark can save bucketed DataFrame into a Hive table.

First, the units of bucketing are different between both frameworks: single bucket file (hive) vs. collection of files per bucket (spark).

Second,

In Hive, each bucket is sorted globally which can optimize queries reading data.

In Spark and till this issue https://issues.apache.org/jira/browse/SPARK-19256 gets (hopefully) resolved, each file is individually sorted but the bucket as a whole is not globally sorted.

Thus, since sorting is not global, there is no benefits form sortBy.

I hope this answers your question.

Upvotes: 1

Related Questions