Reputation: 129
So the table holds a time sequential record of events like so:
+-------------------------+--------+--------+------------+------------+
| Timestamp | id | event | variable 1 | variable 2 |
+-------------------------+--------+--------+------------+------------+
| 2019-05-17 00:00:00.000 | abc123 | event1 | variable1 | null |
| 2019-05-17 00:00:10.000 | abc123 | event2 | null | variable2 |
| 2019-05-17 00:00:15.000 | abc123 | event3 | null | null |
| 2019-05-17 00:05:00.000 | abc123 | event1 | variable1 | null |
| 2019-05-17 00:05:10.000 | abc123 | event4 | null | null |
| 2019-05-17 00:05:15.000 | abc123 | event3 | null | null |
+-------------------------+--------+--------+------------+------------+
The requirement is to count the number of times a specific sequence of events occurs e.g. event1
follows to event2
which follows to event3
. So in the above example the code would return:
+--------+----------------+
| id | sequence_count |
+--------+----------------+
| abc123 | 1 |
+--------+----------------+
The event1
-> event2
->event3
sequence occurred once in the data set for user abc123
, event1
-> event4
->event3
sequence is not counted. The variable used to cut the count could also be switched to give the result:
+------------+----------------+
| variable 1 | sequence_count |
+------------+----------------+
| variable1 | 1 |
+------------+----------------+
For the purposes of this query, the timestamp variable should be treated as ordinal, not cardinal. I honestly have no idea where to start for this, if someone could help with the basis for this type of query I should be able to build it to extract the other insights I want from the data.
Upvotes: 4
Views: 964
Reputation: 48770
You can use the LEAD()
analytic function, as in:
with
x as (
select
event,
lead(event) over(order by timestamp) as next_event,
lead(event, 2) over(order by timestamp) as next_next_event
from t
)
select count(*)
from x
where event = 'event1'
and next_event = 'event2'
and next_next_event = 'event3'
Added:
I'm not quite sure about the additional question you asked in the comments, but it seems to me you want to group by initiating variable. If that's the case, you could do:
with
x as (
select
event,
variable_1,
lead(event) over(order by timestamp) as next_event,
lead(event, 2) over(order by timestamp) as next_next_event
from t
)
select variable_1, count(*)
from x
where event = 'event1'
and next_event = 'event2'
and next_next_event = 'event3'
group by variable_1
Upvotes: 4
Reputation: 172944
Below is for BigQuery Standard SQL
#standardSQL
SELECT id,
ARRAY_LENGTH(
REGEXP_EXTRACT_ALL(
CONCAT(',', STRING_AGG(event ORDER BY Timestamp)),
',event1,event2,event3')
) AS sequence_count
FROM `project.dataset.table`
GROUP BY id
Upvotes: 0
Reputation: 2766
I do not know bigquery, but the following is just some ideas. You should know the start event, e.g. 'event1'.
WITH cteSt (rid,id,timestamp)
AS
(
-- Get all the timestamp for the start event
SELECT ROW_NUMBER() OVER (PARTITION BY id ORDER BY timestamp), id, timestamp
FROM dataset
WHERE event='event1' -- start event
),
cteRange(id,start_ts,end_ts)
AS
(
-- get previous time stamp as end ts for comparing
SELECT s.id,s.timestamp,COALESCE(e.timestamp,current_ts)
FROM cteSt s
LEFT JOIN cteSt e
ON s.id=e.id
AND s.rid+1=e.rid
),
cte_Events(id, start_ts, event_sequence)
AS
(
-- event sequence order by ts
SELECT r.id,r.start_ts, GROUP_CONCAT(d.event ORDER BY d.timestamp SEPARATOR ',')
FROM cteRange r
INNER JOIN dataset d
ON r.id=d.id
AND d.timestamp BETWEEN r.start_ts AND r.end_ts
GROUP BY r.id,r.start_ts
)
-- get the occurrences for each event sequence
SELECT id,event_sequence,COUNT(*) AS occurrences
FROM cte_Events
WHERE event_sequence='YourSequence' -- or get all sequence count without where
Upvotes: 0