user1927808
user1927808

Reputation: 577

Hive Group by after self join

Folks,

We have one requirement where we want to apply group by clause after joining HIVE table with self.

e.g. Data

CUSTOMER_NAME,PRODUCT_NAME,PURCHASE_PRICE

customer1,product1,20
customer1,product2,30
customer1,product1,25

Now we want to fetch customers(top 5 customers after doing just sum of price ,product name not present in sub query ) by considering sum of all products and later group resultset by CUSTOMER_NAME,PRODUCT_NAME

select customer_name,product_name,sum(purchase_price)
from customer_prd cprd
Join (select customer_name,sum(purchase_prices) order by sum group by customer_name limit 5) cprdd
where cprd.customer_name = cprdd.customer_name group by customer_name,product_name

Getting error saying cannot group like this in HIVE ?

Upvotes: 2

Views: 1796

Answers (2)

Jerome Banks
Jerome Banks

Reputation: 1630

I think Joe K is correct, but I would rethink what you are doing, and avoid the join completely, and use the 'collect' or 'collect_max' UDF available in the Brickhouse library ( http://github.com/klout/brickhouse ). First sum by product, then collect and sum at the same time.

SELECT customer_name, sum(purchases) as total_purchases, collect( product_name, purchases) as product_map
FROM
  ( SELECT customer_name, product_name, sum(purchase_prices) AS purchases
    FROM customer_prd
    GROUP BY customer_name, product_name
  ) sp
GROUP BY customer_name
ORDER BY sum(purchases)
LIMIT 5;

This will still cause a sort in order to get the top 5 customers. If you have a large long tail of small customers, but a few large customer whales, you could add a 'HAVING sum(purchases) > ' to whittle down the size of records to be sorted.

Upvotes: 2

Joe K
Joe K

Reputation: 18434

After the join, your column names become ambiguous. Hive doesn't know if you care about the one from the left or right side of the join. In this case, it doesn't matter, since you're doing an inner join on them being equal, but hive isn't smart enough to figure that out. Try this:

select cprd.customer_name, cprd.product_name, sum(purchase_price)
from customer_prd cprd
Join (select customer_name, sum(purchase_price) as sum from customer_prd group by customer_name order by sum desc limit 5) cprdd
where cprd.customer_name = cprdd.customer_name group by cprd.customer_name, cprd.product_name;

Upvotes: 2

Related Questions