Reputation: 75
I have read quite a few articles on bucketing in Spark but haven't still been able to get a clear picture of it. But moreover what i have understood till now is that "Bucketing is like partition inside a partition, and it is used for the candidates having very high cardinality which helps in avoiding reshuffling operation".
Even in Spark documentation, can't find enough explanation. Pasting an example from the documentation
peopleDF.write.bucketBy(42, "name").sortBy("age").saveAsTable("people_bucketed")
Unable to understand here, how the number '42' is decided for bucketing. Kindly help to understand the same. Also, any clearer explanation on bucketing would also be great.
Upvotes: 6
Views: 2569
Reputation: 18053
42 is just like what is the meaning of life? An example therefore.
Spark Bucketing is handy for ETL in Spark whereby Spark Job A writes out the data for t1 according to Bucketing def and Spark Job B writes out data for t2 likewise and Spark Job C joins t1 and t2 using Bucketing definitions avoiding shuffles aka exchanges. Optimization.
There is no general formula. It depends on volumes, available executors, etc. The main point is avoiding shuffling. As a guideline defaults for JOINing and AGGr are set to 200, so 200 or greater could be an approach, but again how many resources do you have on your cluster?
But to satisfy your quest for knowledge one could argue that the 42 should really be set to number of Executors (= 1 core) you have allocated to the Spark Job/ App, leaving aside the issue of skewness.
Upvotes: 1