Reputation: 1255
In my application, events are generated for every action that user performs, and events are generated with data in the below format -
user_id | step_num | event_timestamp
Order in which these events are streamed are not completely strict i.e. we might have user 2 data streamed in before all events of user 1 are captured.
Sample Data
u1, 1, 2020-05-30 00:00:01
u1, 2, 2020-05-30 00:00:02
u2, 1, 2020-05-30 00:00:02
u3, 1, 2020-05-30 00:00:02
u1, 3, 2020-05-30 00:00:03
....
How would you implement any streaming solution to calculate average time taken for each step. We can assume total time taken by a user for each step to be (time_stamp_at_step_1 - time_stamp_at_step_0).
I am using Spark Streaming to build the solution but could not find anything to compute statistics taking data from 2 event that are grouped by an attribute(user_id in my case). Would love to know if there are any available solutions in other streaming tech like Flink, Kafka etc.
Upvotes: 0
Views: 404
Reputation: 43454
This could be done in various ways with Flink, but one approach would be to use Flink SQL with MATCH_RECOGNIZE:
SELECT step_num, AVG(seconds)
FROM events
MATCH_RECOGNIZE (
PARTITION BY userId
ORDER BY eventTime
MEASURES
this_step.userId as id
this_step.step AS step_num
TIMESTAMPDIFF(SECOND, this_step.eventTime, next_step.eventTime) AS seconds
AFTER MATCH SKIP TO LAST next_step
PATTERN (this_step next_step)
DEFINE
this_step AS TRUE,
next_step AS next_step.step = this_step.step + 1
)
GROUP BY step_num
Upvotes: 1