Comparing efficiency of Hive queries with different join orders

Question

Consider the following two queries in Hive:

SELECT
    *
FROM
    A
INNER JOIN 
    B
INNER JOIN
    C
ON 
    A.COL = B.COL
AND A.COL = C.COL

and

SELECT
    *
FROM
    A
INNER JOIN
    B
ON
    A.COL = B.COL
INNER JOIN
    C
ON
    A.COL = C.COL

Question: Are the two queries computationally same or different? In other words, to get the fastest results should I prefer to write one versus the other, or it doesn't matter? Thanks.

leftjoin · Accepted Answer

On Hive 1.2, also tested on Hive 2.3, both on Tez, the optimizer is intelligent enough to derive ON condition for join with table B and performs two INNER JOINs each with correct it's own ON condition.

Checked on simple query

with A as (
select stack(3,1,2,3) as id
),
B as (
select stack(3,1,2,3) as id
),
C as (
select stack(3,1,2,3) as id
)

select * from A 
inner join B
inner join C
ON A.id = B.id AND A.id = C.id

Explain command shows that both joins are executed as map-join on single mapper and each join has it's own join condition. This is explain output:

Map 1 File Output Operator [FS_17] Map Join Operator [MAPJOIN_27] (rows=1 width=12) Conds:FIL_24.col0=RS_12.col0(Inner),FIL_24.col0=RS_14.col0(Inner),HybridGraceHashJoin:true,Output:["_col0","_col1","_col2"]

First I thought that it will be CROSS join with table B in first query, then join with C will reduce the dataset, but both queries work the same(the same plan, the same execution), thanks to the optimizer.

Also I tested the same with map-join switched off (set hive.auto.convert.join=false;) and also got exactly the same plan for both queries. I did not test it for really big tables, you better double-check.

So, computationally both are the same on Hive 1.2 and Hive 2.3 for map-join and merge join on reducer

Comparing efficiency of Hive queries with different join orders

Answers (1)

Related Questions