remedy for duplicate data in HIVE

Question

Below is my data flow

say RDBMS has 10 records

step -1
RDBMS --> SQOOP --> HIVE(e.g. table A (initial load))

step -2
RDBMS(modified record {record 2}) --> SQOOP(incremental)--> HDFS --> temp HIVE tbl(table B) --> HIVE(table A)

While moving data in step 2 from table B to table A, I use "insert" command in HIVE with the "APPEND" option. So, obviously, now the table A will have duplicate data.

How to get rid of these duplicate data? is there a way? if yes, what are they? if no, then what is the industry standard?

Rahib · Accepted Answer

Yes, you can use a staging table for this. I created two extra tables.

DBMS_old: For old data that I want to keep in final table but is not there in the new result. DBMS_new: For new data that I want to append in the final table.

Step1: Get all the old data in the staging table using left outer join, This will get all your old data into dbms_old.

insert overwrite table DBMS_old
select final.* from DBMS_results final
left outer join DBMS_new new
on final.primary_key = new.primary_key and final.col2 = new.col2
where new.primary_key is null;

Step2: Append the old data and new data using union all.

insert overwrite table DBMS
select * from 
(select * from DBMS_old
union all
select * from DBMS_new) temp;

Hope it helps.

remedy for duplicate data in HIVE

Answers (1)

Related Questions