AJG
AJG

Reputation: 129

Counting the occurrences of a sequence of events spread over multiple rows

So the table holds a time sequential record of events like so:

+-------------------------+--------+--------+------------+------------+
|        Timestamp        |   id   | event  | variable 1 | variable 2 |
+-------------------------+--------+--------+------------+------------+
| 2019-05-17 00:00:00.000 | abc123 | event1 | variable1  | null       |
| 2019-05-17 00:00:10.000 | abc123 | event2 | null       | variable2  |
| 2019-05-17 00:00:15.000 | abc123 | event3 | null       | null       |
| 2019-05-17 00:05:00.000 | abc123 | event1 | variable1  | null       |
| 2019-05-17 00:05:10.000 | abc123 | event4 | null       | null       |
| 2019-05-17 00:05:15.000 | abc123 | event3 | null       | null       |
+-------------------------+--------+--------+------------+------------+

The requirement is to count the number of times a specific sequence of events occurs e.g. event1 follows to event2 which follows to event3. So in the above example the code would return:

+--------+----------------+
|   id   | sequence_count |
+--------+----------------+
| abc123 |              1 |
+--------+----------------+

The event1 -> event2 ->event3 sequence occurred once in the data set for user abc123, event1 -> event4 ->event3 sequence is not counted. The variable used to cut the count could also be switched to give the result:

+------------+----------------+
| variable 1 | sequence_count |
+------------+----------------+
| variable1  |              1 |
+------------+----------------+

For the purposes of this query, the timestamp variable should be treated as ordinal, not cardinal. I honestly have no idea where to start for this, if someone could help with the basis for this type of query I should be able to build it to extract the other insights I want from the data.

Upvotes: 4

Views: 964

Answers (3)

The Impaler
The Impaler

Reputation: 48770

You can use the LEAD() analytic function, as in:

with
x as (
  select
    event,
    lead(event) over(order by timestamp) as next_event,
    lead(event, 2) over(order by timestamp) as next_next_event
  from t
)
select count(*)
from x
where event = 'event1'
  and next_event = 'event2'
  and next_next_event = 'event3'

Added:

I'm not quite sure about the additional question you asked in the comments, but it seems to me you want to group by initiating variable. If that's the case, you could do:

with
x as (
  select
    event,
    variable_1,
    lead(event) over(order by timestamp) as next_event,
    lead(event, 2) over(order by timestamp) as next_next_event
  from t
)
select variable_1, count(*)
from x
where event = 'event1'
  and next_event = 'event2'
  and next_next_event = 'event3'
group by variable_1

Upvotes: 4

Mikhail Berlyant
Mikhail Berlyant

Reputation: 172944

Below is for BigQuery Standard SQL

#standardSQL
SELECT id, 
  ARRAY_LENGTH(
    REGEXP_EXTRACT_ALL(
      CONCAT(',', STRING_AGG(event ORDER BY Timestamp)), 
      ',event1,event2,event3')
  ) AS sequence_count
FROM `project.dataset.table`
GROUP BY id

Upvotes: 0

PeterHe
PeterHe

Reputation: 2766

I do not know bigquery, but the following is just some ideas. You should know the start event, e.g. 'event1'.

WITH cteSt (rid,id,timestamp)
AS
(
  -- Get all the timestamp for the start event
  SELECT ROW_NUMBER() OVER (PARTITION BY id ORDER BY timestamp), id, timestamp 
  FROM dataset 
  WHERE event='event1' -- start event
),
cteRange(id,start_ts,end_ts)
AS
(
  -- get previous time stamp as end ts for comparing
  SELECT s.id,s.timestamp,COALESCE(e.timestamp,current_ts)
  FROM cteSt s
  LEFT JOIN cteSt e
  ON s.id=e.id
  AND s.rid+1=e.rid
),
cte_Events(id, start_ts, event_sequence)
AS
(
  -- event sequence order by ts
  SELECT r.id,r.start_ts, GROUP_CONCAT(d.event ORDER BY d.timestamp SEPARATOR ',')
  FROM cteRange r
  INNER JOIN dataset d
  ON r.id=d.id
  AND d.timestamp BETWEEN r.start_ts AND r.end_ts
  GROUP BY r.id,r.start_ts
)
-- get the occurrences for each event sequence
SELECT id,event_sequence,COUNT(*) AS occurrences
FROM cte_Events
WHERE event_sequence='YourSequence' -- or get all sequence count without where

Upvotes: 0

Related Questions