guana
guana

Reputation: 3

Clickhouse creating columns base on value of other rows

I'm trying to know how much record a user has generated during the last x hours from when the record is created. In classic SQL what I've tried so far should be working.

I'm posting here because I didn't find anything specific to Clickhouse about this specific problem.

Have a nice day ;)

Simplified example - source table:

ID | time          | name  
---+---------------+---------
1   05/05 14:20       bob         
2   05/05 14:30       josh       
3   05/05 18:30       bob         
4   06/05 15:30       bob         
5   08/05 18:30       josh           

Select result - source table :

ID | time          | name  | nb_last_24hours
---+---------------+-------+----------------
1   05/05 14:20       bob         0
2   05/05 14:30       josh        0
3   05/05 18:30       bob         1
4   06/05 15:30       bob         1
5   08/05 18:30       josh        0

I've tried this query:

SELECT  
    b.ID, b.`time`, b.name,
    (SELECT COUNT(*) 
     FROM T1 AS a 
     WHERE a.name = b.name 
       AND a.`time` < b.`time` 
       AND a.`time` >= (b.`time` - (60 * 60 * 24))
    )
FROM  
    T1 AS b;

but I get the following error:

SQL Error [47] [07000]: Code: 47. DB::Exception: Missing columns: 'b.time' 'b.a_msisdn' while processing query: 'SELECT count() FROM VFG.voice_traffic AS a WHERE (a_msisdn = b.a_msisdn) AND (time < b.time) AND (time >= (b.time - ((60 * 60) * 24)))', required columns: 'a_msisdn' 'b.a_msisdn' 'time' 'b.time', maybe you meant: ['a_msisdn','a_msisdn','time','time']: While processing (SELECT count(*) FROM VFG.voice_traffic AS a WHERE (a.a_msisdn = b.a_msisdn) AND (a.time < b.time) AND (a.time >= (b.time - ((60 * 60) * 24)))) AS _subquery4916. (UNKNOWN_IDENTIFIER) (version 22.3.3.44 (official build)) , server ClickHouseNode [uri=http://192.168.13.79:8123/default, options={use_server_time_zone=false,use_time_zone=false}]@-1433098529

I'm assuming that b doesn't exist in the context of the inner query.

Upvotes: 0

Views: 520

Answers (2)

Mark Barinstein
Mark Barinstein

Reputation: 12339

Try COUNT with RANGE.

SELECT 
  c1, c2
, COUNT (1) OVER 
  (
    PARTITION BY c1
    ORDER BY c2 
    RANGE BETWEEN 
    86400 PRECEDING 
    AND 
    1 PRECEDING
  )
  as cnt
FROM VALUES 
(
  ('bob', now())
, ('bob', now() + INTERVAL 10 HOUR)
, ('bob', now() + INTERVAL 25 HOUR)

, ('josh', now())
, ('josh', now() + INTERVAL 10 HOUR)
, ('josh', now() + INTERVAL 15 HOUR)
)
FORMAT PrettyCompact

The result is:

+-c1---+------------------c2-+-cnt-+
| bob  | 2023-09-15 20:22:34 |   0 |
| bob  | 2023-09-16 06:22:34 |   1 |
| bob  | 2023-09-16 21:22:34 |   1 |
| josh | 2023-09-15 20:22:34 |   0 |
| josh | 2023-09-16 06:22:34 |   1 |
| josh | 2023-09-16 11:22:34 |   2 |
+------+---------------------+-----+

fiddle

Upvotes: 1

Rich Raposa
Rich Raposa

Reputation: 828

You should be able to accomplish that with one query - something like this:

SELECT 
    name,
    time,
    count()
FROM T1
WHERE time >= now() - INTERVAL 24 HOUR
GROUP BY (name, time);

Upvotes: 1

Related Questions