Aleksei Cherniaev
Aleksei Cherniaev

Reputation: 401

What is the difference between bucketBy and partitionBy in Spark?

For example, I want to save a table, what is the difference between the two strategies?

bucketBy:

someDF.write.format("parquet")
      .bucketBy(4, "country")
      .mode(SaveMode.OverWrite)
      .saveAsTable("someTable")

partitionBy:

someDF.write.format("parquet")
      .partitionBy("country") # <-- here is the only difference
      .mode(SaveMode.OverWrite)
      .saveAsTable("someTable")

I guess, that bucketBy in first case creates 4 directories with countries, while partitionBy will create as many directories as many unique values in column "countries". is it correct understanding ?

Upvotes: 11

Views: 9501

Answers (2)

Chris
Chris

Reputation: 1455

Some differences:

I guess, that bucketBy in first case creates 4 directories with countries, while partitionBy will create as many directories as many unique values in column "countries". is it correct understanding?

Yes, for partitionBy. However bucketBy will create 4 bucket files (Parquet by default).

Upvotes: 8

aladeen
aladeen

Reputation: 307

Unlike bucketing in Apache Hive, Spark SQL creates the bucket files per the number of buckets and partitions. In other words, the number of bucketing files is the number of buckets multiplied by the number of task writers (one per partition).

You could also use bucketBy along with partitionBy, by which each partition (last level partition in case of multilevel paritioning) will have 'n' number of buckets.

Upvotes: 0

Related Questions