kingJulian
kingJulian

Reputation: 6170

In Postgresql, how can I find 3 consecutive events where only the first one occurs within a time frame?

I have the table below with user_id, timestamp and event_id. The column "tag" denotes if this is the desirable row (tag = 1) or not (tag = 0):

user_id | timestamp                 | event_id | tag 

    46  | 2018-12-21 08:42:35.000   | 1        | 0
    46  | 2018-12-21 09:58:35.000   | 2        | 1
    46  | 2018-12-22 06:42:35.000   | 3        | 0
    46  | 2018-12-22 07:18:35.000   | 4        | 1
    46  | 2018-12-22 08:30:35.000   | 5        | 1
    46  | 2018-12-23 06:42:35.000   | 6        | 0
    46  | 2018-12-23 06:11:35.000   | 7        | 1
    46  | 2018-12-23 07:51:35.000   | 8        | 1
    46  | 2018-12-23 07:26:35.000   | 9        | 1
    46  | 2018-12-23 07:37:35.000   | 10       | 1
    46  | 2018-12-23 08:05:35.000   | 11       | 1
    46  | 2018-12-23 08:20:35.000   | 12       | 1 
    46  | 2018-12-23 09:10:35.000   | 13       | 1
    46  | 2018-12-23 09:42:35.000   | 14       | 0
    46  | 2018-12-23 10:17:35.000   | 15       | 1   
    46  | 2018-12-24 09:42:35.000   | 16       | 0
    46  | 2018-12-24 10:45:35.000   | 17       | 0
    46  | 2018-12-24 11:12:35.000   | 18       | 0
    46  | 2018-12-24 11:51:35.000   | 19       | 1
    122 | 2018-12-22 08:30:35.000   | 1        | 1
    122 | 2018-12-23 06:42:35.000   | 2        | 0
    122 | 2018-12-23 06:11:35.000   | 3        | 1
    122 | 2018-12-23 07:51:35.000   | 4        | 1
    122 | 2018-12-23 07:26:35.000   | 5        | 1
    122 | 2018-12-23 07:37:35.000   | 6        | 1
    122 | 2018-12-28 06:42:35.000   | 1        | 0
    122 | 2018-12-28 06:38:35.000   | 2        | 1
    122 | 2018-12-28 07:51:35.000   | 3        | 1
    122 | 2018-12-28 07:26:35.000   | 4        | 1
    122 | 2018-12-28 08:42:35.000   | 5        | 0
    122 | 2018-12-28 09:38:35.000   | 6        | 0
    122 | 2018-12-28 10:51:35.000   | 7        | 0
    122 | 2018-12-28 11:26:35.000   | 8        | 0

So I would like to find:

  1. The number of times the user had 3 CORRECT (tag = 1) consecutive events (i.e. triplets) in the same date.
  2. The timestamp of the 1st event for each of those 3 consecutive events.

Ideally the returned table should look like this:

user_id | first_occurrence           |event_id | consecutive_events 
     46 | 2018-12-23 06:11:35.000    | 7       | 2  <-- 2 consecutive triplets 
     46 | 2018-12-23 07:37:35.000    | 10      | 2  <-- this has 4 consecutive events  but I am only interested in triplets of events.
     122| 2018-12-23 06:11:35.000    | 4       | 1
     122| 2018-12-28 06:38:35.000    | 2       | 1  

In other words the columns consecutive_events must display all the triplets for a user in each day while the columns first_occurrence & event_id should display the fist timestamp & event_id of each triplet per user_id & date.

NOTE: The user_id 46 has a triplet of 0 (tag = 0). These triplet should be excluded.

    46  | 2018-12-24 09:42:35.000   | 16       | 0
    46  | 2018-12-24 10:45:35.000   | 17       | 0
    46  | 2018-12-24 11:12:35.000   | 18       | 0

I tried to use the DENSE_RANK() function but the results are far from optimal:

dense_rank() over (partition by user_id, date(timestamp) order by tag,date(timestamp) ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)

[UPDATE]

The example I'm referring to in the first comment of Gordon's answer is the following. For these consecutive events:

user_id | timestamp                 | event_id | tag 
    46  | 2018-12-23 06:11:35.000   | 7        | 1
    46  | 2018-12-23 07:51:35.000   | 8        | 1
    46  | 2018-12-23 07:26:35.000   | 9        | 1
    46  | 2018-12-23 07:37:35.000   | 10       | 1
    46  | 2018-12-23 08:05:35.000   | 11       | 1
    46  | 2018-12-23 08:20:35.000   | 12       | 1 
    46  | 2018-12-23 09:10:35.000   | 13       | 1

the query returns:

 user_id | min(timestamp)            | min_event_id | num_consecutive 
     46  | 2018-12-23 06:11:35.000   | 7            | 2

It should also return

user_id | min(timestamp)            | min_event_id | num_consecutive 
     46  | 2018-12-23 06:11:35.000   | 7            | 2
     46  | 2018-12-23 07:37:35.000   | 10           | 2

Do you think this is possible to fetch as well?

Upvotes: 0

Views: 559

Answers (2)

Santiago Ca&#241;as
Santiago Ca&#241;as

Reputation: 1

The number of times the user had 3 (tag = 1 or tag = 0) consecutive events (i.e. triplets) in the same date.

The timestamp of the 1st event for each of those 3 consecutive events.

Ideally, the returned table should look like this:

with raw_data as (

  select
    user_id,
    timestamp,
    timestamp::date as date,
    event_id,
    tag, 
    lag(tag, -1) over(partition by user_id, timestamp::date order by timestamp) as next_tag,
    lag(tag, -2) over(partition by user_id, timestamp::date order by timestamp) as next_tag_1
  
  from raw_data.test

)

select
  user_id,
  date,
  min(timestamp) as first_event,
  count(distinct event_id) as number_of_times

from raw_data
where (tag = 1
  and next_tag = 1
  and next_tag_1 = 1)
group by 1,2

Upvotes: 0

Gordon Linoff
Gordon Linoff

Reputation: 1269445

This is a gaps and islands problem. The difference of row numbers seems like the best approach:

To get all the adjacent values:

select user_id, min(timestamp) as timestamp,
       count(*) as num_consecutive,
       min(event_id) as min_event_id
from (select t.*,
             row_number() over (partition by user_id, timestamp::date order by timestamp) as seqnum,
             row_number() over (partition by user_id, timestamp::date, tag order by timestamp) as seqnum_t
      from t
     ) t
group by user_id, timestamp::date, tag, (seqnum - seqnum_t);

I you want each sequence individually, just add where tag = 1 and having count(*) >= 3 this query.

To convert this to your desired result set, use a subquery:

select user_id, min(event_id), min(timestamp),
       (sum(num_consecutive) / 3)
from (select user_id, min(timestamp) as timestamp,
             count(*) as num_consecutive,
             min(event_id) as min_event_id
      from (select t.*,
                   row_number() over (partition by user_id, timestamp::date order by timestamp) as seqnum,
                   row_number() over (partition by user_id, timestamp::date, tag order by timestamp) as seqnum_t
            from t
           ) t
      where tag = 1
      group by user_id, timestamp::date, tag, (seqnum - seqnum_t)
     ) t
where num_consecutive >= 3
group by user_id, timestamp::date;

Upvotes: 1

Related Questions