Shekhar
Shekhar

Reputation: 11788

Hive self join based on one column

I have one table in Hive for which data has come from SAP system. This table has columns and data as given below:

+======================================================================+
|document_number | year | cost_centre | vendor_account_number | amount | 
+----------------------------------------------------------------------+
|       1        | 2016 |     XZ10    |                       |  123.5 |
+----------------------------------------------------------------------+
|       1        | 2016 |     XZ10    |      1234567890       |  25.96 |
+----------------------------------------------------------------------+
|       1        | 2016 |     XZ10    |                       |  586   |
+----------------------------------------------------------------------+

As shown above, value for vendor_account_number column is present in only 1 row and I want to bring it on all the rest of the rows.

Expected output is as follows:

+======================================================================+
|document_number | year | cost_centre | vendor_account_number | amount | 
+----------------------------------------------------------------------+
|       1        | 2016 |     XZ10    |      1234567890       |  123.5 |
+----------------------------------------------------------------------+
|       1        | 2016 |     XZ10    |      1234567890       |  25.96 |
+----------------------------------------------------------------------+
|       1        | 2016 |     XZ10    |      1234567890       |  586   |
+----------------------------------------------------------------------+

To achieve this, I wrote following CTE in Hive

with non_blank_account_no as(
  select document_number, vendor_account_number
  from my_table
  where vendor_account_number != ''
)

and then did self left outer join as follows:

select 
    a.document_number, a.year, 
    a.cost_centre, a.amount,
    b.vendor_account_number
from my_table a
left outer join non_blank_account_no b on a.document_number = b.document_number
where a.document_number = ' '

but I am getting duplicated output as shown below

+======================================================================+
|document_number | year | cost_centre | vendor_account_number | amount | 
+----------------------------------------------------------------------+
|       1        | 2016 |     XZ10    |      1234567890       |  123.5 |
+----------------------------------------------------------------------+
|       1        | 2016 |     XZ10    |      1234567890       |  25.96 |
+----------------------------------------------------------------------+
|       1        | 2016 |     XZ10    |      1234567890       |  586   |
+----------------------------------------------------------------------+
|       1        | 2016 |     XZ10    |      1234567890       |  123.5 |
+----------------------------------------------------------------------+
|       1        | 2016 |     XZ10    |      1234567890       |  25.96 |
+----------------------------------------------------------------------+
|       1        | 2016 |     XZ10    |      1234567890       |  586   |
+----------------------------------------------------------------------+

Can anyone please help me understanding what is wrong with my Hive query?

Upvotes: 0

Views: 976

Answers (1)

David דודו Markovitz
David דודו Markovitz

Reputation: 44941

In many use-cases a self-join can be replaced by a windows function

select  document_number
       ,year
       ,cost_center

       ,max (case when vendor_account_number <> '' then vendor_account_number end) over 
        (
            partition by    document_number
        )                                       as vendor_account_number

       ,amount

from    my_table    

Upvotes: 1

Related Questions