Gaurav Ghongde
Gaurav Ghongde

Reputation: 369

How do I optimize UNNEST in BigQuery for user journey analytics use case to effectively scale with billions of rows from clickstream events data

I want to create a feature similar to sequential segmentation provided by Adobe Analytics (Ref) I am using an external BQ table created on top of clickstream data stored in GCS location in the format of parquet, partitioned on date_key and event_name.

Data schema: date_key, user_id, session_id, event_name

My use case could have been easily solved with the use of MATCH_RECOGNIZE; however, since I am limited to using BigQuery for certain reasons and MATCH_RECOGNIZE is not supported in BigQuery, it has become challenging to find a solution.

I was trying UNNEST, but it is not scaling for larger data Here is the sample query I am using for segment: A then B then C then D, with some operator based conditions in between.

SELECT
  session_id,
  ROW_NUMBER() OVER (PARTITION BY session_id) as hit_counter,
FROM `table`
WHERE
  date_key >= 20241201 and date_key <= 20241231
  AND event_name IN ('A', 'B', 'C', 'D')
),
event_mapping AS (
 SELECT
   session_id,
   ARRAY_AGG(STRUCT(event_name, hit_counter)) AS event_sequence
 FROM filtered_ds
 GROUP BY session_id
),
combinations AS (
 SELECT
   session_id
 FROM event_mapping,
 UNNEST(event_sequence) a,
 UNNEST(event_sequence) b,
 UNNEST(event_sequence) c,
 UNNEST(event_sequence) d,
 UNNEST(event_sequence) e
 WHERE
   a.event_name = 'A' AND
   b.event_name = 'B' AND
   c.event_name = 'C' AND
   d.event_name = 'D' AND
   e.event_name = 'E' AND
   b.hit_counter > a.hit_counter 
   c.hit_counter - b.hit_counter BETWEEN 0 AND 10 AND
   d.hit_counter - c.hit_counter > 15 AND
   e.hit_counter > d.hit_counter
)
SELECT COUNT(DISTINCT session_id)
FROM combinations;```

Upvotes: 0

Views: 20

Answers (0)

Related Questions