Reputation: 63
How to compare two tables with same structures in hive. I believe minus will not work in hive.
SRC table:
id name
1 A
2 B
3 C
TGT table:
id name
1 A
2 C
3 C
can anyone help me with a query.
Upvotes: 2
Views: 10019
Reputation: 38325
Minus does not exists in Hive. You can use left join or full join for find different records from both tables.
LEFT JOIN: - allows to find records that not exist in TGT
select a.id, a.name
from TGT a left join TGT b on a.id=b.id and a.name=b.name
where b.id is null
The same may be achieved using NOT EXISTS
select a.id, a.name
from TGT a
where not exists(select 1 from TGT b where a.id=b.id and a.name=b.name)
FULL JOIN: - allows to find differences from both the tables
select a.id as SRC_ID, a.name as SRC_NAME ,
b.id as TGT_ID, b.name as TGT_NAME
from TGT a full join TGT b on a.id=b.id and a.name=b.name
where a.id is null OR b.id is null
Upvotes: 1
Reputation: 329
There is a free tool (dataq.io) which can be used to compare data between two tables in hive. It can visually display the differences.
Disclaimer : I work for dataq.io
Upvotes: 1
Reputation: 719
To do a full comparison of 2 tables, my experience has shown me that using some checksum mechanism is a more effective and reliable solution than Joining the tables (which gives some difficulties when keys are repeated for instance).
You could have a look at this Python program that handles such comparisons of Hive tables (comparing all the rows and all the columns), and would show you in a webpage the differences that might appear: https://github.com/bolcom/hive_compared_bq
Upvotes: 1
Reputation: 805
If you are looking for equality between two tables and for differences if any, you can do like following
SELECT MIN(TableName) as TableName, ID, NAME
FROM
(
SELECT 'SRC_TABLE' as TableName, A.ID, A.NAME
FROM A
UNION ALL
SELECT 'TGT_TABLE' as TableName, B.ID, B.NAME
FROM B
) tmp
GROUP BY ID, NAME
HAVING COUNT(*) = 1
ORDER BY ID
Basically what the above query does is as follows,
Combines all the records from both the tables & groups them by all columns. Later prints out only those records which are unique [having count(*) = 1].
If the above query prints zero records then it means both the tables are perfectly equal
For the above example you listed the output is as follows,
[ TBL_NAME | ID | NAME ]
----------------------------
[ SRC_TABLE | 2 | B ]
[ TGT_TABLE | 2 | C ]
Credits: I found the above solution from this website. I tried it, and it was fantastic!
Upvotes: 4
Reputation: 44941
select id
,SRC_name
,TGT_name
from (select id
,min (case tab when 'S' then name end) as SRC_name
,min (case tab when 'T' then name end) as TGT_name
,count (case tab when 'S' then 1 end) as SRC_count
,count (case tab when 'T' then 1 end) as TGT_count
from ( select 'S' as tab,id,name from SRC
union all select 'T' as tab,id,name from TGT
) t
group by id
) t
having SRC_count = 0
or TGT_count = 0
or SRC_name != TGT_name
;
If name
might be NULL or id
might be non-unique, additional conditions should be added
Upvotes: 1