KaraiKare
KaraiKare

Reputation: 169

Find time difference between non-consecutive rows depending on column values in Athena

I have a dataset with conversation data between a User and a Customer Service. I need to find the response time (time difference) between after a user sent a message and first time the Customer Service responded to the message

enter image description here

What I struggle with is a user can send multiple consecutive message before the Customer Service can reply. If a user sent multiple consecutive message, I need to take the time of the last row, but if the customer service sent multiple consecutive message, I need to take the time of the first row

My expected result is as below:

enter image description here

So, from the 10 message_id in the base table, there will be 3 rows:

I've tried using lead function as follows but the result is not what I expected

 select sender,
        start_message,
        start_reply,
        date_diff('second',start_message,start_reply)/60.0 as response_time
from (
        select message_id,
                created_at as start_message,
                  lead(created_at) over(partition by sender  order by created_at) as start_reply,
                  sender_type
          from base_table
     ) t 
 order by 2 asc

I'm using athena, any help is greatly appreciated, thank you

Upvotes: 0

Views: 122

Answers (2)

Equinox
Equinox

Reputation: 6758

You can shift all the columns by 1 and select all rows where sender is user and shifted sender is cs. This will always give you last message of user and first message of support. provided data is ordered

Select message_id, cs_message_id,extract(seconds from cs_created_at - created_at) as response_time from 
  (
  Select *,
  lead(sender, 1, sender) over (order by created_at) as cs_sender,
  lead(message_id, 1, message_id) over (order by created_at) as cs_message_id,
  lead(created_at, 1, created_at) over (order by created_at) as cs_created_at
  from tbl_name
  )
where cs_sender = 'cs' and sender = 'user';

Output -

message_id  cs_message_id   response_time
3   4   8.000000
7   8   7.000000
9   10  13.000000

Upvotes: 1

ahmed
ahmed

Reputation: 9201

One approach is to use a flag that is set to one whenever the sender is 'user' and the previous sender is 'cs', then use a running sum on that flag to group every two pairs of consecutive (user, cs) as a single group.

with t as
(
  select *,
    case when 
      sender ='user' and lag(sender, 1, sender) over (order by created_at) = 'cs' 
      then 1 else 0 
    end as flag
 from tbl_name
),
grps as
(
  select *, 
    sum(flag) over (order by created_at) as grp
  from t
)
select max(case when sender='user' then message_id end) as message_id_user,
       min(case when sender='cs' then message_id end) as message_id_cs, 
       date_diff('second', max(case when sender='user' then created_at end), min(case when sender='cs' then created_at end)) as response_time
from grps
group by grp
order by grp

Upvotes: 1

Related Questions