Getting set difference with duplicate rows from Hive tables

Question

I have two Hive tables: Table1, Table2. Table1 has duplicate rows, Table2 does not. I want to get the missing data from Table1 which is not present in Table2, including the duplicates. How can I get this done in Hive Query Language?

Example:

Table1 data:

Col1,Col2
A1,V1
A1,V1
A2,V2
A3,V3
A3,V3
A3,V3
A4,V4

Table2 data:

Col1,Col2
A1,V1
A2,V2
A3,V3

I want to get the following missing data from Table1:

Col1,Col2
A1,V1
A3,V3
A3,V3
A4,V4

User9123 · Accepted Answer

you can use something like:

with t1 as (
  select 'A1' col1,'V1' col2 union all
  select 'A1' col1,'V1' col2 union all
  select 'A2' col1,'V2' col2 union all
  select 'A3' col1,'V3' col2 union all
  select 'A3' col1,'V3' col2 union all
  select 'A3' col1,'V3' col2 union all
  select 'A4' col1,'V4' col2
),
t2 as (
  select 'A1' col1,'V1' col2 union all
  select 'A2' col1,'V2' col2 union all
  select 'A3' col1,'V3' col2
),
t1_with_rn as (
  select t1.*, row_number() over(partition by t1.col1, t1.col2) rn from t1
)
select 
  t1_with_rn.col1, t1_with_rn.col2
from 
  t1_with_rn
  left join t2 on (t1_with_rn.col1 = t2.col1 and t1_with_rn.col2 = t2.col2 and t1_with_rn.rn = 1)
where
  t2.col1 is null and t2.col2 is null

Getting set difference with duplicate rows from Hive tables

Answers (1)

Related Questions