Maverick
Maverick

Reputation: 494

Optimized hive query needed

We have two tables containing huge data. And we have to perform inner join between these two tables. Joining condition is based on 3 columns c1,c2,c3.

Need your help in writing an optimized query for the above scenario. Please refer to the data model below.

table1

c1,c2,c3,c4,c5


table2

c1,c2,c3

Appreciate your help. Thanks.

Upvotes: 0

Views: 193

Answers (3)

Roshini
Roshini

Reputation: 723

An inner join will give you all the rows that exist in both table 1 and table 2 as well as any extra combinations that exist with the other columns present in table 2. A left semi join like @dimamah mentioned will give you only those rows that are present in the left table 1.

For the first solution where you should expect more rows

SELECT t1.* 
FROM table1 t1
INNER JOIN table2 t2
ON (t1.c1=t2.c1 AND t1.c1 = t2.c2 AND t1.c3 = t2.c3)

For the second solution where you will get the same number of rows as t1

SELECT t1.*
FROM table1 t1
LEFT SEMI JOIN table2 t2
ON (t1.c1=t2.c1 AND t1.c1 = t2.c2 AND t1.c3 = t2.c3)

Upvotes: 0

dimamah
dimamah

Reputation: 2903

as it seems you are not using any new data from table2 and the logic is "return all rows in table1 that exist in table2". This is done by using left semi join.
The number of records might increase if table2 is not uniqe (same combination of c1 c2 c3 appears more than once).

Upvotes: 1

Doctor Dan
Doctor Dan

Reputation: 771

If I understand Hortonworks' founder Owen O'Malley correctly, you may optimize it by writing

SELECT t1.* from table1 t1
JOIN (SELECT table2.* FROM table2) t2
ON (t1.c1=t2.c1 AND t1.c1 = t2.c2 AND t1.c3 = t2.c3)

Upvotes: 2

Related Questions