Reputation: 77
I've two tables in hive with exact schema. Both the tables have exact no of row count. I need to compare the individual column records between both the tables. If a particular record value is mismatched, the entire row should be thrown as output. The tables have approximately 358 columns and millions of records.
Upvotes: 3
Views: 2687
Reputation: 331
This is what you can do:
Join both the tables using the unique key( i believe u must be having unique identifier in ur table) use the hash value of all the columns combined using hash function in hive to figure out the difference.query will look like this:
select * from tab1 a join tab2 b
using a.id=b.id
where hash(a.col1,a.col2....)<>hash(b.col1,b.col2...);
Upvotes: 5
Reputation: 11274
As an example, I have identical table structure (tbl1
, tbl2
) with differing values (department_id = 4
),
select * from tbl1;
+---------------------+-----------------------+--+
| tbl1.department_id | tbl1.department_name |
+---------------------+-----------------------+--+
| 2 | Fitness |
| 3 | Footwear |
| 4 | Apparel |
| 5 | Golf |
| 6 | Outdoors |
| 7 | Fan Shop |
+---------------------+-----------------------+--+
select * from tbl2 where department_id = 4;
+---------------------+-----------------------+--+
| tbl2.department_id | tbl2.department_name |
+---------------------+-----------------------+--+
| 4 | Hive |
+---------------------+-----------------------+--+
I can do,
select department_id, department_name, count(*)
from (
select * from tbl1
union all
select * from tbl2 ) both
group by department_id, department_name
having count(*) = 1 //if count(*) is 2 -> rows of tbl1,tbl2 are identical.
to get
+----------------+------------------+------+--+
| department_id | department_name | _c2 |
+----------------+------------------+------+--+
| 4 | Apparel | 1 |
| 4 | Hive | 1 |
+----------------+------------------+------+--+
You may need to test with missing rows from one another etc.
Upvotes: 1