Reputation: 234
Consider the following two queries in Hive:
SELECT
*
FROM
A
INNER JOIN
B
INNER JOIN
C
ON
A.COL = B.COL
AND A.COL = C.COL
and
SELECT
*
FROM
A
INNER JOIN
B
ON
A.COL = B.COL
INNER JOIN
C
ON
A.COL = C.COL
Question: Are the two queries computationally same or different? In other words, to get the fastest results should I prefer to write one versus the other, or it doesn't matter? Thanks.
Upvotes: 1
Views: 207
Reputation: 38335
On Hive 1.2, also tested on Hive 2.3, both on Tez, the optimizer is intelligent enough to derive ON condition for join with table B and performs two INNER JOINs each with correct it's own ON condition.
Checked on simple query
with A as (
select stack(3,1,2,3) as id
),
B as (
select stack(3,1,2,3) as id
),
C as (
select stack(3,1,2,3) as id
)
select * from A
inner join B
inner join C
ON A.id = B.id AND A.id = C.id
Explain command shows that both joins are executed as map-join on single mapper and each join has it's own join condition. This is explain output:
Map 1 File Output Operator [FS_17] Map Join Operator [MAPJOIN_27] (rows=1 width=12) Conds:FIL_24.col0=RS_12.col0(Inner),FIL_24.col0=RS_14.col0(Inner),HybridGraceHashJoin:true,Output:["_col0","_col1","_col2"]
First I thought that it will be CROSS join with table B in first query, then join with C will reduce the dataset, but both queries work the same(the same plan, the same execution), thanks to the optimizer.
Also I tested the same with map-join switched off (set hive.auto.convert.join=false;
) and also got exactly the same plan for both queries. I did not test it for really big tables, you better double-check.
So, computationally both are the same on Hive 1.2 and Hive 2.3 for map-join and merge join on reducer
Upvotes: 1