Hive inefficient nested joins

Question

I have two large Hive tables (billions of records) that can be joined by a common key (i.e. the transaction ID). table_a contains all the transaction ID's, and table_b has transactions for some of the time.

enter image description here

In theory, there should be no duplicate transaction ID's in either table. In practice, there are a few duplicates and, while they're relatively few in number, the many:many join has potential to cause problems. I'm building a model based on this data and, as a first pass, it's probably easier to ignore all the records with duplicate transaction ID's.

I wrote an ugly Hive query that, although it logically does what it needs to do, it's horribly inefficient:

SELECT
   table_a.someCol,
   table_b.anotherCol,
   [etc...]
FROM
    (SELECT
       table_a.*
     FROM table_a
     INNER JOIN
       (SELECT
          transaction_id
        FROM table_a
        GROUP BY transaction_id
        HAVING COUNT(*) = 1) unique_transaction_ids
    ON table_a.transaction_id = unique_transaction_ids.transaction_id) table_a_unique_transaction_ids_only
LEFT OUTER JOIN
    (SELECT
       table_b.*
     FROM table_b
     INNER JOIN
      (SELECT
         transaction_id
       FROM table_b
       GROUP BY transaction_id
       HAVING COUNT(*) = 1) unique_transaction_ids
    ON table_b.transaction_id = unique_transaction_ids.transaction_id) table_b_unique_transaction_ids_only
ON table_a_unique_transaction_ids_only.transaction_id = table_b_unique_transaction_ids_only.transaction_id;

How it works:

First, for for both tables, create a list of transaction ID's that that only appear once:

SELECT
    transaction_id
FROM table_?
GROUP BY transaction_id
HAVING COUNT(*) = 1

Then filter the tables by inner joining the unique transaction subqueries to the original tables.

Finally, left outer join the filtered subqueries.

Is there a more efficient way to write this (e.g. using analytical functions such as ROW_NUMBER() OVER ...)?

o-90 · Accepted Answer

If you want table_a and table_b together, the LEFT OUTER JOIN seems inevitable. The two self-joins can be avoided. Since you want transaction_id's that have occurred only once, you can take MAX() or MIN() on the rest of the columns without loss of information (even if they are not numerical columns). Something like

select transaction_id
  , max(col1) col1
  , max(col2) col2
        .
        .
        .
  , max(coln) coln
from table_a
group by transaction_id
having count(transaction_id) = 1

It's a way to "pull columns through" to the next "level" without having to group by them. It can be tedious to write if you have a lot of columns but is usually worth it to avoid two self-joins.

Hive inefficient nested joins

Answers (1)

Related Questions