Reputation: 534
I have a table called "Sold_Items" like below. And I want to use Spark SQL to get the net sell volumes for each participant.
Item Buyer Seller Qty
----------------------------------
A JD Lidl 100
B SD JD 500
A Coop JD 125
C JD SD 300
Intermediate table
Item Participant Buy Sell
--------------------------------------------
A JD 100 125
B JD 0 500
C JD 300 0
A Coop 125 0
A Lidl 0 100
B SD 500 0
C SD 0 300
Final result should look something like below.
Item Participant Net Sell
----------------------------------
A JD 25
B JD 500
C JD -300
A Coop -125
A Lidl 100
B SD -500
C SD 300
I have below two queries for buy and sell side of the first table.
Buy:
SELECT Item, Buyer, sum(qty) as buy_qty from sold_items group by Item, Buyer
Sells:
SELECT Item, Seller, sum(qty) as sell_qty from sold_items group by Item, Seller
I am trying to get the intermediate table so I can use that table to get the final result. But I cannot seem to join the two queries. Would appreciate any suggestions on combining the above two queries to get the intermediate table.
Upvotes: 1
Views: 718
Reputation: 1271141
Unpivot and reaggregate. This is simplest with union all
:
select user, sum(buy_qty), sum(sell_qty)
from ((select buyer as user, sum(qty) as buy_qty, 0 as sell_qty
from sold_items
group by buyer
) union all
(select seller as user, 0, sum(qty)
from sold_items
group by seller
)
) bs
group by user;
Note that the aggregation in the subqueries is not really needed, so this will also work:
select user, sum(buy_qty), sum(sell_qty)
from ((select buyer as user, qty as buy_qty, 0 as sell_qty
from sold_items
) union all
(select seller as user, 0, qty
from sold_items
)
) bs
group by user;
I would expect the multiple aggregation version to have better performance on large data sets -- although the improvement might not be that big.
Upvotes: 1