Reputation: 137
I have a hive efficiency question. I have 2 massive queries that need to be filtered, joined with mapping tables, and unioned. All the joins are identical for both tables. Would it be more efficient to union them before applying the joins to the combined table or to apply the joins to each massive query individually then union the results? Does it make a difference?
I tried the second way and the query ran for 24 hours before I killed it. I feel like I did everything I could to optimize it except potentially rearrange the union statement. On the one hand, I feel like it should not matter because the number or rows being joined by the mapping table is the same and since everything is palatalized, it should take roughly the same amount of time. On the other hand, maybe by doing the union first, it should guarantee that the two big queries are given full system resources before the joins are run. Then again, that might mean that there are only 2 jobs running at a time so the system is not being fully used or something.
I simply do not know enough about how hive and it's multi-threading works. Anybody have any ideas?
Upvotes: 1
Views: 4217
Reputation: 38335
There is no such best practice. Both approaches are applicable. Subqueries in UNION ALL are running as parallel jobs. So join before union will work as parallel tasks with smaller datasets, tez can optimize execution and common joined tables will be read only once in single mapper stage for each table. Also you can avoid joins for some subqueries for example if their keys are not applicable for join.
Join with union-ed bigger dataset also may work with very high parallelism depending on your settings (bytes per reducer for example), optimizer also may rewrite query plan. So I suggest you to check both methods, measure speed, study plan and check if you can change something. Change, measure, study plan... repeat
Few more suggestions: Try to limit datasets before joining them. If your join multiplies rows then analytics and aggregation may work slower on bigger datasets and first approach may be preferable if you can apply analytics/aggregation before union.
Upvotes: 1