Reputation: 44
I have two Hive tables: Table1, Table2. Table1 has duplicate rows, Table2 does not. I want to get the missing data from Table1 which is not present in Table2, including the duplicates. How can I get this done in Hive Query Language?
Example:
Table1 data:
Col1,Col2
A1,V1
A1,V1
A2,V2
A3,V3
A3,V3
A3,V3
A4,V4
Table2 data:
Col1,Col2
A1,V1
A2,V2
A3,V3
I want to get the following missing data from Table1:
Col1,Col2
A1,V1
A3,V3
A3,V3
A4,V4
Upvotes: 0
Views: 32
Reputation: 1733
you can use something like:
with t1 as (
select 'A1' col1,'V1' col2 union all
select 'A1' col1,'V1' col2 union all
select 'A2' col1,'V2' col2 union all
select 'A3' col1,'V3' col2 union all
select 'A3' col1,'V3' col2 union all
select 'A3' col1,'V3' col2 union all
select 'A4' col1,'V4' col2
),
t2 as (
select 'A1' col1,'V1' col2 union all
select 'A2' col1,'V2' col2 union all
select 'A3' col1,'V3' col2
),
t1_with_rn as (
select t1.*, row_number() over(partition by t1.col1, t1.col2) rn from t1
)
select
t1_with_rn.col1, t1_with_rn.col2
from
t1_with_rn
left join t2 on (t1_with_rn.col1 = t2.col1 and t1_with_rn.col2 = t2.col2 and t1_with_rn.rn = 1)
where
t2.col1 is null and t2.col2 is null
Upvotes: 1