Abhi
Abhi

Reputation: 1255

Calculate Time Difference between events in streaming datasets

In my application, events are generated for every action that user performs, and events are generated with data in the below format -

user_id | step_num | event_timestamp

Order in which these events are streamed are not completely strict i.e. we might have user 2 data streamed in before all events of user 1 are captured.

Sample Data

u1, 1, 2020-05-30 00:00:01

u1, 2, 2020-05-30 00:00:02

u2, 1, 2020-05-30 00:00:02

u3, 1, 2020-05-30 00:00:02

u1, 3, 2020-05-30 00:00:03

....

How would you implement any streaming solution to calculate average time taken for each step. We can assume total time taken by a user for each step to be (time_stamp_at_step_1 - time_stamp_at_step_0).

I am using Spark Streaming to build the solution but could not find anything to compute statistics taking data from 2 event that are grouped by an attribute(user_id in my case). Would love to know if there are any available solutions in other streaming tech like Flink, Kafka etc.

Upvotes: 0

Views: 404

Answers (1)

David Anderson
David Anderson

Reputation: 43454

This could be done in various ways with Flink, but one approach would be to use Flink SQL with MATCH_RECOGNIZE:

SELECT step_num, AVG(seconds)
FROM events
MATCH_RECOGNIZE (
    PARTITION BY userId
    ORDER BY eventTime
    MEASURES
        this_step.userId as id
        this_step.step AS step_num
        TIMESTAMPDIFF(SECOND, this_step.eventTime, next_step.eventTime) AS seconds
    AFTER MATCH SKIP TO LAST next_step
    PATTERN (this_step next_step)
    DEFINE
        this_step AS TRUE,
        next_step AS next_step.step = this_step.step + 1
)
GROUP BY step_num

Upvotes: 1

Related Questions