Spark SQL to join two results from same table

Question

I have a table called "Sold_Items" like below. And I want to use Spark SQL to get the net sell volumes for each participant.

Item     Buyer   Seller   Qty
----------------------------------
A        JD      Lidl     100
B        SD      JD       500
A        Coop    JD       125
C        JD      SD       300

Intermediate table

Item     Participant      Buy         Sell
--------------------------------------------
A        JD                100        125 
B        JD                0          500
C        JD                300          0
A        Coop              125          0
A        Lidl                0        100
B        SD                500          0    
C        SD                  0        300

Final result should look something like below.

Item     Participant      Net Sell
----------------------------------
A        JD                 25
B        JD                500
C        JD               -300
A        Coop             -125
A        Lidl              100
B        SD               -500  
C        SD                300

I have below two queries for buy and sell side of the first table.

Buy:

SELECT Item, Buyer, sum(qty) as buy_qty from sold_items group by Item, Buyer

Sells:

SELECT Item, Seller, sum(qty) as sell_qty from sold_items group by Item, Seller

I am trying to get the intermediate table so I can use that table to get the final result. But I cannot seem to join the two queries. Would appreciate any suggestions on combining the above two queries to get the intermediate table.

Gordon Linoff · Accepted Answer

Unpivot and reaggregate. This is simplest with union all:

select user, sum(buy_qty), sum(sell_qty)
from ((select buyer as user, sum(qty) as buy_qty, 0 as sell_qty
       from sold_items
       group by buyer
      ) union all
      (select seller as user, 0, sum(qty)
       from sold_items
       group by seller
      )
     ) bs
group by user;

Note that the aggregation in the subqueries is not really needed, so this will also work:

select user, sum(buy_qty), sum(sell_qty)
from ((select buyer as user, qty as buy_qty, 0 as sell_qty
       from sold_items
      ) union all
      (select seller as user, 0, qty
       from sold_items
      )
     ) bs
group by user;

I would expect the multiple aggregation version to have better performance on large data sets -- although the improvement might not be that big.

Spark SQL to join two results from same table

Answers (1)

Related Questions