HIVE - what are the use cases for a bucket join

Question

I can't seem to find any good use case for a bucket join in hive.
As i see it, When joining table A with table B :
A bucket join saves us the time of passing Table A to the reducers while loading Table B into the distributed cache and each mapper processes the corresponding bucket of Table B vs the bucket of Table A.

But, the loading of Table B into the distributed cache is done by a single task thus as the table get gets bigger this becomes a bottleneck.
So, If table B is small enough not to burden a single task its practically the same as doing a regular map-join with a small optimization.

On the other hand if table B can't fit into a single mapper has a whole, the process of reading it to the distributed cache could take a while.

Finally, it seems that the time to load table B into the distributed cache might be worth it because we don't need to pass the buckets of table A from the mappers to the reducers but this process shouldn't be too heavy unless table A is really big, because each mapper would read a single bucket that corresponds to a single reducer (the tables are bucketed by the join key) each reducer fetches 2 intermediate outputs (one for each table, not bad chance that the reducer is running on the same node as its corresponding mapper) and merges them and from this point the join is the same as in the mappers.

To conclude, I think the question is what costs more :

Loading a moderate size table into the distributed cache by a single task
Passing a lot of moderate (maybe big) size buckets from the mappers to the reducers (mostly locally) and merging 2 files - all done in parallel.

What do you think? Can someone find a good usage to bucket join?

HIVE - what are the use cases for a bucket join

Answers (1)

Related Questions