sql_learner
sql_learner

Reputation: 547

Getting Top 40% users basis sales

I have a table which has columns date, user_id, sales_amount. The table sample is as below

+------------+---------+--------------+
|    date    | user_id | sales_amount |
+------------+---------+--------------+
| 2020-01-01 | 1       | 27           |
| 2020-01-01 | 2       | 32           |
| 2020-01-01 | 3       | 17           |
| 2020-01-03 | 1       | 19           |
| 2020-01-03 | 2       | 18           |
| 2020-01-03 | 3       | 40           |
| …………..     | …………..  | …………..       |
| …………..     | …………..  | …………..       |
| …………..     | …………..  | …………..       |
+------------+---------+--------------+

I want to get top 40% users basis sales. I would have used something like SELECT TOP 40 PERCENT users after aggregation. But I am not using MS-SQL server, so that method is not applicable.

Something that I know is as below

First get number of rows from below query

SELECT MAX(Rn) AS number_of_rows

FROM(

SELECT *,row_number() OVER(ORDER BY Amt DESC) as Rn

FROM 
(SELECT user_id, SUM(AMOUNT) AS Amt
FROM table
GROUP BY user_id) A ) B

Second calculate the 40 % of the above value and get the users

SELECT *
FROM

(SELECT *,row_number() OVER(ORDER BY Amt DESC) as Rn

FROM 
(SELECT user_id, SUM(AMOUNT) AS Amt
FROM table
GROUP BY user_id) A ) B

WHERE Rn <= 0.4* (number_of_rows)

Above two steps can be combined as below

SELECT *
FROM

(SELECT *,row_number() OVER(ORDER BY Amt DESC) as Rn

FROM 
(SELECT user_id, SUM(AMOUNT) AS Amt
FROM table
GROUP BY user_id) A ) B

WHERE Rn <= 0.4 * (SELECT MAX(Rn) AS number_of_rows

FROM(

SELECT *,row_number() OVER(ORDER BY Amt DESC) as Rn

FROM 
(SELECT user_id, SUM(AMOUNT) AS Amt
FROM table
GROUP BY user_id) A ) B)

Is there any optimum way/builtin function to obtain this in hive ?

Upvotes: 0

Views: 72

Answers (1)

Gordon Linoff
Gordon Linoff

Reputation: 1269813

Yes! You can do both in one step:

SELECT u.*
FROM (SELECT user_id, SUM(AMOUNT) as amt,
             ROW_NUMBER() OVER (ORDER BY SUM(AMOUNT) DESC) as seqnum,
             COUNT(*) OVER () as cnt

      FROM t
      GROUP BY user_id
     ) u
WHERE seqnum <= cnt * 0.4;

Upvotes: 1

Related Questions