Reputation: 547
I have a table which has columns date
, user_id
, sales_amount
. The table sample is as below
+------------+---------+--------------+
| date | user_id | sales_amount |
+------------+---------+--------------+
| 2020-01-01 | 1 | 27 |
| 2020-01-01 | 2 | 32 |
| 2020-01-01 | 3 | 17 |
| 2020-01-03 | 1 | 19 |
| 2020-01-03 | 2 | 18 |
| 2020-01-03 | 3 | 40 |
| ………….. | ………….. | ………….. |
| ………….. | ………….. | ………….. |
| ………….. | ………….. | ………….. |
+------------+---------+--------------+
I want to get top 40% users basis sales. I would have used something like SELECT TOP 40 PERCENT users
after aggregation. But I am not using MS-SQL server, so that method is not applicable.
Something that I know is as below
First get number of rows from below query
SELECT MAX(Rn) AS number_of_rows
FROM(
SELECT *,row_number() OVER(ORDER BY Amt DESC) as Rn
FROM
(SELECT user_id, SUM(AMOUNT) AS Amt
FROM table
GROUP BY user_id) A ) B
Second calculate the 40 % of the above value and get the users
SELECT *
FROM
(SELECT *,row_number() OVER(ORDER BY Amt DESC) as Rn
FROM
(SELECT user_id, SUM(AMOUNT) AS Amt
FROM table
GROUP BY user_id) A ) B
WHERE Rn <= 0.4* (number_of_rows)
Above two steps can be combined as below
SELECT *
FROM
(SELECT *,row_number() OVER(ORDER BY Amt DESC) as Rn
FROM
(SELECT user_id, SUM(AMOUNT) AS Amt
FROM table
GROUP BY user_id) A ) B
WHERE Rn <= 0.4 * (SELECT MAX(Rn) AS number_of_rows
FROM(
SELECT *,row_number() OVER(ORDER BY Amt DESC) as Rn
FROM
(SELECT user_id, SUM(AMOUNT) AS Amt
FROM table
GROUP BY user_id) A ) B)
Is there any optimum way/builtin function to obtain this in hive ?
Upvotes: 0
Views: 72
Reputation: 1269813
Yes! You can do both in one step:
SELECT u.*
FROM (SELECT user_id, SUM(AMOUNT) as amt,
ROW_NUMBER() OVER (ORDER BY SUM(AMOUNT) DESC) as seqnum,
COUNT(*) OVER () as cnt
FROM t
GROUP BY user_id
) u
WHERE seqnum <= cnt * 0.4;
Upvotes: 1