Reputation: 4563
I have two large Hive tables (billions of records) that can be joined by a common key (i.e. the transaction ID). table_a
contains all the transaction ID's, and table_b
has transactions for some of the time.
In theory, there should be no duplicate transaction ID's in either table. In practice, there are a few duplicates and, while they're relatively few in number, the many:many join has potential to cause problems. I'm building a model based on this data and, as a first pass, it's probably easier to ignore all the records with duplicate transaction ID's.
I wrote an ugly Hive query that, although it logically does what it needs to do, it's horribly inefficient:
SELECT
table_a.someCol,
table_b.anotherCol,
[etc...]
FROM
(SELECT
table_a.*
FROM table_a
INNER JOIN
(SELECT
transaction_id
FROM table_a
GROUP BY transaction_id
HAVING COUNT(*) = 1) unique_transaction_ids
ON table_a.transaction_id = unique_transaction_ids.transaction_id) table_a_unique_transaction_ids_only
LEFT OUTER JOIN
(SELECT
table_b.*
FROM table_b
INNER JOIN
(SELECT
transaction_id
FROM table_b
GROUP BY transaction_id
HAVING COUNT(*) = 1) unique_transaction_ids
ON table_b.transaction_id = unique_transaction_ids.transaction_id) table_b_unique_transaction_ids_only
ON table_a_unique_transaction_ids_only.transaction_id = table_b_unique_transaction_ids_only.transaction_id;
How it works:
First, for for both tables, create a list of transaction ID's that that only appear once:
SELECT
transaction_id
FROM table_?
GROUP BY transaction_id
HAVING COUNT(*) = 1
Then filter the tables by inner joining the unique transaction subqueries to the original tables.
Finally, left outer join the filtered subqueries.
Is there a more efficient way to write this (e.g. using analytical functions such as ROW_NUMBER() OVER ...
)?
Upvotes: 1
Views: 869
Reputation: 17613
If you want table_a
and table_b
together, the LEFT OUTER JOIN
seems inevitable. The two self-joins can be avoided. Since you want transaction_id
's that have occurred only once, you can take MAX()
or MIN()
on the rest of the columns without loss of information (even if they are not numerical columns). Something like
select transaction_id
, max(col1) col1
, max(col2) col2
.
.
.
, max(coln) coln
from table_a
group by transaction_id
having count(transaction_id) = 1
It's a way to "pull columns through" to the next "level" without having to group by
them. It can be tedious to write if you have a lot of columns but is usually worth it to avoid two self-joins.
Upvotes: 2