ssk
ssk

Reputation: 63

compare data between two tables with same structure in hive

How to compare two tables with same structures in hive. I believe minus will not work in hive.

SRC table:

id name
1   A
2   B
3   C

TGT table:

id name

1   A
2   C
3   C

can anyone help me with a query.

Upvotes: 2

Views: 10019

Answers (5)

leftjoin
leftjoin

Reputation: 38325

Minus does not exists in Hive. You can use left join or full join for find different records from both tables.

LEFT JOIN: - allows to find records that not exist in TGT

select a.id, a.name 
  from TGT a left join TGT b on a.id=b.id and a.name=b.name
where b.id is null

The same may be achieved using NOT EXISTS

select a.id, a.name 
      from TGT a 
where not exists(select 1 from  TGT b where a.id=b.id and a.name=b.name)

FULL JOIN: - allows to find differences from both the tables

select a.id as SRC_ID, a.name as SRC_NAME ,
       b.id as TGT_ID, b.name as TGT_NAME
  from TGT a full join TGT b on a.id=b.id and a.name=b.name
where a.id is null OR b.id is null

Upvotes: 1

firemonkey
firemonkey

Reputation: 329

There is a free tool (dataq.io) which can be used to compare data between two tables in hive. It can visually display the differences.

Disclaimer : I work for dataq.io

Upvotes: 1

Sourygna
Sourygna

Reputation: 719

To do a full comparison of 2 tables, my experience has shown me that using some checksum mechanism is a more effective and reliable solution than Joining the tables (which gives some difficulties when keys are repeated for instance).

You could have a look at this Python program that handles such comparisons of Hive tables (comparing all the rows and all the columns), and would show you in a webpage the differences that might appear: https://github.com/bolcom/hive_compared_bq

Upvotes: 1

avrsanjay
avrsanjay

Reputation: 805

If you are looking for equality between two tables and for differences if any, you can do like following

SELECT MIN(TableName) as TableName, ID, NAME FROM ( SELECT 'SRC_TABLE' as TableName, A.ID, A.NAME FROM A UNION ALL SELECT 'TGT_TABLE' as TableName, B.ID, B.NAME FROM B ) tmp GROUP BY ID, NAME HAVING COUNT(*) = 1 ORDER BY ID

Basically what the above query does is as follows,

  • Combines all the records from both the tables & groups them by all columns. Later prints out only those records which are unique [having count(*) = 1].

  • If the above query prints zero records then it means both the tables are perfectly equal

  • If the above query prints any records at all, then it means those particular records are not present in one or the other tables
  • For the above example you listed the output is as follows,

    [ TBL_NAME | ID | NAME ]

    ----------------------------

    [ SRC_TABLE | 2 | B ]

    [ TGT_TABLE | 2 | C ]

Credits: I found the above solution from this website. I tried it, and it was fantastic!

Upvotes: 4

David דודו Markovitz
David דודו Markovitz

Reputation: 44941

select      id
           ,SRC_name
           ,TGT_name

from       (select      id
                       ,min   (case tab when 'S' then name end) as SRC_name
                       ,min   (case tab when 'T' then name end) as TGT_name
                       ,count (case tab when 'S' then 1    end) as SRC_count 
                       ,count (case tab when 'T' then 1    end) as TGT_count 

            from        (           select 'S' as tab,id,name from SRC
                        union all   select 'T' as tab,id,name from TGT
                        ) t

            group by    id  
            ) t

having      SRC_count = 0
        or  TGT_count = 0
        or  SRC_name != TGT_name
;

If name might be NULL or id might be non-unique, additional conditions should be added

Upvotes: 1

Related Questions