Reputation: 837
When we should not use bucketing in hive? What is the bottleneck of this technique?
Upvotes: 3
Views: 1672
Reputation: 21
You should not prefer bucketing when cardinality of partitioning field is not too high. In that case partitioning is more beneficial. And bucketing can only be done on one field whereas partitioning can be done on multiple fields , with an order like(country, city, state).
Upvotes: 0
Reputation: 439
Bucketing is method by which we distribute the data into files. which would otherwise be unevenly distributed.
When to use Bucketing: When we know that query will use column such as "customer_id" which is sequencial or evenly distributed.
When Not to use Bucketing: We would not use bucketing when we know that most use case of the table involve reading subset of data.
For Example: although we keep historical data, we only process last 2 weeks data to determine something. In this scenario we would use partition by weekno.
Upvotes: 0
Reputation: 2923
I guess you don't have to use bucketing when you can't benefit from it. As far as I know among main benefits of bucketing: more efficient sampling and map-side joins(see bellow). So if your table is small or you don't need fast sampling and map-side joins just don't use it because you will need to remember that you have to bucket you data before insertion, manually or using set hive.enforce.bucketing = true;
There is no bottleneck, it's just one of possible data layouts which allow you to take advantage in some situations.
Hive map-side join example (see more here):
If the tables being joined are bucketized on the join columns, and the number of buckets in one table is a multiple of the number of buckets in the other table, the buckets can be joined with each other. If table A has 4 buckets and table B has 4 buckets, the following join
SELECT a.key, a.value
FROM a JOIN b ON a.key = b.key
can be done on the mapper only. Instead of fetching B completely for each mapper of A, only the required buckets are fetched. For the query above, the mapper processing bucket 1 for A will only fetch bucket 1 of B. It is not the default behavior, and is governed by the following parameter
set hive.optimize.bucketmapjoin = true
Update Considering the data skew when bucketing.
Bucket number calculated using hash_function(bucketing_column) mod num_buckets
. If your bucketing column is of int type then hash_int(i) == i
(see more here). So if you have skewed values in that column, one value appears much more often then the others for example, then many more rows will be placed in a corresponding bucket, you will have disproportional buckets, this harms the query speed. Hive have build-in tools to overcome data skewness(see Skewed Tables) but I don't think you should use a column with skewed data for bucketing in the first place.
Upvotes: 1