Charitra kocheri
Charitra kocheri

Reputation: 127

Hive join two tables by removing duplicates

I have two hive tables -

Table1:
col1 col2
---------
1     b
2     c
3     d
Table2:
col1 col2 col3
----------------
1     b     x
1     b     y
1     b     z
2     c     x
3     d     x

Here as you can see in Table 2 I have duplicate values for col1 and col2. I want to remove these duplicates when I do select. I dont care about col3.

I am looking for select query which will give me -

col1 col2 col3
----------------
1     b     y
2     c     x
3     d     x

I tried using different JOINs but end up getting duplicate values.

Upvotes: 2

Views: 2134

Answers (1)

Kaushik Nayak
Kaushik Nayak

Reputation: 31648

use GROUP BY along with JOIN

SELECT a.col1,
  a.col2,
  MAX(b.col3) AS col3
FROM Table1 a
JOIN Table2 b
ON a.col1 = b.col1
  AND a.col2 = b.col2
GROUP BY a.col1,
  a.col2;

I don't care about col3.

So, if you don't want it in output, you could simply use EXISTS

SELECT a.col1,
       a.col2
FROM   table1 a
WHERE  EXISTS (SELECT 1
               FROM   table2 b
               WHERE  a.col1 = b.col1);  

Upvotes: 2

Related Questions