Reputation: 16697
What is the proper way to query top N rows by group in ClickHouse?
Lets take an example of tbl having id2, id4, v3 columns and N=2.
I tried the following
SELECT
id2,
id4,
v3 AS v3
FROM tbl
GROUP BY
id2,
id4
ORDER BY v3 DESC
LIMIT 2 BY
id2,
id4
but getting error
Received exception from server (version 19.3.4):
Code: 215. DB::Exception: Received from localhost:9000, 127.0.0.1. DB::Exception
: Column v3 is not under aggregate function and not in GROUP BY..
I could put v3
into GROUP BY and it does seems to work, but it is not efficient to group by a metric.
There is any
aggregate function, but we actually want all
values (limited to 2 by LIMIT BY clause) not any
value, so it doesn't sound like to be proper solution here.
SELECT
id2,
id4,
any(v3) AS v3
FROM tbl
GROUP BY
id2,
id4
ORDER BY v3 DESC
LIMIT 2 BY
id2,
id4
Upvotes: 5
Views: 16779
Reputation: 311
You can also do it the way you would do it in "normal" SQL as described in this thread
While vladimir's solutions works for many cases, it didn't work for my case. I have a table, that looks like this:
column | group by
++++++++++++++++++++++
A | Yes
B | Yes
C | No
Now, imagine column A identifies the user and column B stands for whatever action a user could do e. g. on your website or your online game. Column C is the sum of how often the user has done this particular action. Vladimir's solution would allow me to get column A and C, but not the action the user has done (column B), meaning I would know how often a user has done something, but not what.
The reason for this is that it doesn't make sense to group by both A and B. Every row would be a unique group and you aren't able to find the top K rows since every group has only 1 member. The result is the same table you query against. Instead, if you group only by A, you can apply vladimir's solution but would get only columns A and C. You can't output column B because it's not part of the Group By statement as explained.
If you would like to get the top 2 (or top 5, or top 100) actions a user has done, you might look for a solution that this:
SELECT rs.id2, rs.id4, rs.v3
FROM (
SELECT id2, id4, v3, row_number()
OVER (PARTITION BY id2, id4 ORDER BY v3 DESC) AS Rank
FROM tbl
) rs WHERE Rank <= 2
Note: To use this, you have to set allow_experimental_window_functions = 1
.
Upvotes: 1
Reputation: 15226
It can be used aggregate functions like this:
SELECT
id2,
id4,
arrayJoin(arraySlice(arrayReverseSort(groupArray(v3)), 1, 2)) v3
FROM tbl
GROUP BY
id2,
id4
Upvotes: 7