jangorecki
jangorecki

Reputation: 16697

Top N rows by group in ClickHouse

What is the proper way to query top N rows by group in ClickHouse?
Lets take an example of tbl having id2, id4, v3 columns and N=2. I tried the following

SELECT                                                                          
    id2,                                                                        
    id4,                 
    v3 AS v3        
FROM tbl
GROUP BY                 
    id2,                 
    id4                  
ORDER BY v3 DESC                                                                
LIMIT 2 BY                       
    id2,                 
    id4      

but getting error

Received exception from server (version 19.3.4):
Code: 215. DB::Exception: Received from localhost:9000, 127.0.0.1. DB::Exception
: Column v3 is not under aggregate function and not in GROUP BY..

I could put v3 into GROUP BY and it does seems to work, but it is not efficient to group by a metric.

There is any aggregate function, but we actually want all values (limited to 2 by LIMIT BY clause) not any value, so it doesn't sound like to be proper solution here.

SELECT                                                                          
    id2,                                                                        
    id4,                 
    any(v3) AS v3        
FROM tbl
GROUP BY                 
    id2,                 
    id4                  
ORDER BY v3 DESC                                                                
LIMIT 2 BY                       
    id2,                 
    id4      

Upvotes: 5

Views: 16779

Answers (2)

Richard Steinbrecht
Richard Steinbrecht

Reputation: 311

You can also do it the way you would do it in "normal" SQL as described in this thread

While vladimir's solutions works for many cases, it didn't work for my case. I have a table, that looks like this:

column    | group by    
++++++++++++++++++++++
A         | Yes
B         | Yes
C         | No

Now, imagine column A identifies the user and column B stands for whatever action a user could do e. g. on your website or your online game. Column C is the sum of how often the user has done this particular action. Vladimir's solution would allow me to get column A and C, but not the action the user has done (column B), meaning I would know how often a user has done something, but not what.

The reason for this is that it doesn't make sense to group by both A and B. Every row would be a unique group and you aren't able to find the top K rows since every group has only 1 member. The result is the same table you query against. Instead, if you group only by A, you can apply vladimir's solution but would get only columns A and C. You can't output column B because it's not part of the Group By statement as explained.

If you would like to get the top 2 (or top 5, or top 100) actions a user has done, you might look for a solution that this:

SELECT rs.id2, rs.id4, rs.v3
    FROM (
        SELECT id2, id4, v3, row_number()
        OVER (PARTITION BY id2, id4 ORDER BY v3 DESC) AS Rank
        FROM tbl
    ) rs WHERE Rank <= 2

Note: To use this, you have to set allow_experimental_window_functions = 1.

Upvotes: 1

vladimir
vladimir

Reputation: 15226

It can be used aggregate functions like this:

SELECT
    id2,
    id4,
    arrayJoin(arraySlice(arrayReverseSort(groupArray(v3)), 1, 2)) v3
FROM tbl
GROUP BY
    id2,
    id4

Upvotes: 7

Related Questions