Reputation: 2893
I can't seem to find any good use case for a bucket join in hive.
As i see it, When joining table A with table B :
A bucket join saves us the time of passing Table A to the reducers while loading Table B into the distributed cache and each mapper processes the corresponding bucket of Table B vs the bucket of Table A.
But, the loading of Table B into the distributed cache is done by a single task thus as the table get gets bigger this becomes a bottleneck.
So, If table B is small enough not to burden a single task its practically the same as doing a regular map-join with a small optimization.
On the other hand if table B can't fit into a single mapper has a whole, the process of reading it to the distributed cache could take a while.
Finally, it seems that the time to load table B into the distributed cache might be worth it because we don't need to pass the buckets of table A from the mappers to the reducers but this process shouldn't be too heavy unless table A is really big, because each mapper would read a single bucket that corresponds to a single reducer (the tables are bucketed by the join key) each reducer fetches 2 intermediate outputs (one for each table, not bad chance that the reducer is running on the same node as its corresponding mapper) and merges them and from this point the join is the same as in the mappers.
To conclude, I think the question is what costs more :
What do you think? Can someone find a good usage to bucket join?
Upvotes: 0
Views: 2063
Reputation: 5223
I think you're confusing bucket join with a mapjoin. In the map join, the smaller table is loaded into the distributed cache, assuming it's small enough, and it is send to all the mappers. There's a 1 to N correspondence.
In a bucket join, you're joining two large tables both of which store the data in the same way: in N buckets (files), bucketed and sorted by the same column you're joining. So table A has N buckets, table B has N buckets too, so you can mergesort bucket #1 of A with bucket #1 of B, #2 with #2 etc. It's a 1 to 1 correspondence , N times. This is also done on the map side, but the distributed cache is not involved.
Upvotes: 0