Ayradyss
Ayradyss

Reputation: 11

Pandas, check if timestamp value is within x minutes of previous timestamp

I have a dataframe that looks like the below with a user id, timestamp and a song name. The timestamp is when the user has started playing the song. A session is defined where each song is started within 20 minutes of the previous song’s start time. I need to create a list of the top 10 longest sessions, with the following information about each session: userid, timestamp of first and last songs in the session, and the list of songs played in the session provided in order of play. Could you please help?

         user       timestamp          song
0        user_000001  05-05-09 12:08   The Start of Something
1        user_000001  04-05-09 14:54   My Sharona
2        user_000001  04-05-09 14:52   Caught by the river
3        user_000001  04-05-09 14:42   Swim
19       user_000001  03-05-09 15:56   Cover me
20       user_000001  03-05-09 15:50   Oh Holy Night
1048550  user_000050   25-01-07 8:51   I Hung My Head
1048551  user_000050   25-01-07 8:48   Slider
1048552  user_000050  24-01-07 22:57   Joy
1048553  user_000050  24-01-07 22:53   Crazy Eights
1048554  user_000050  24-01-07 22:48   Steady State
1048555  user_000050  24-01-07 22:42   Maple Leaves (7" Version)

Upvotes: 0

Views: 1045

Answers (2)

asongtoruin
asongtoruin

Reputation: 10359

Without altering the order of your data, we can do the following:

import pandas as pd
from io import StringIO

data = StringIO('''id,user,timestamp,song
0,user_000001,05-05-09 12:08,The Start of Something
1,user_000001,04-05-09 14:54,My Sharona
2,user_000001,04-05-09 14:52,Caught by the river
3,user_000001,04-05-09 14:42,Swim
19,user_000001,03-05-09 15:56,Cover me
20,user_000001,03-05-09 15:50,Oh Holy Night
1048550,user_000050, 25-01-07 8:51,I Hung My Head
1048551,user_000050, 25-01-07 8:48,Slider
1048552,user_000050,24-01-07 22:57,Joy
1048553,user_000050,24-01-07 22:53,Crazy Eights
1048554,user_000050,24-01-07 22:48,Steady State
1048555,user_000050,24-01-07 22:42,Maple Leaves (7" Version)''')

def time_elapsed(grp, session_length):
    grp['MinsElapsed'] = (grp['timestamp'] - grp['timestamp'].shift(-1)) / pd.Timedelta(minutes=1)
    grp['Session'] = (grp['MinsElapsed'] > session_length)[::-1].astype(int).cumsum()[::-1]
    return grp


df = pd.read_csv(data)
df['timestamp'] = pd.to_datetime(df['timestamp'])

df = df.groupby('user').apply(time_elapsed, session_length=20)

print(df)

We group by user, and work out the time difference between the row below (.shift(-1)) in minutes. We then check if this column returns a value larger than the session length, convert that to an integer and apply a cumulative sum. As the times are in descending order, to get this to work properly we have to reverse the entire column before doing the cumulative sum, and then reset it afterwards.

This gives us:

         id         user           timestamp                       song  MinsElapsed  Session 
0         0  user_000001 2009-05-05 12:08:00     The Start of Something      43034.0        2 
1         1  user_000001 2009-04-05 14:54:00                 My Sharona          2.0        1 
2         2  user_000001 2009-04-05 14:52:00        Caught by the river         10.0        1 
3         3  user_000001 2009-04-05 14:42:00                       Swim      44566.0        1 
4        19  user_000001 2009-03-05 15:56:00                   Cover me          6.0        0 
5        20  user_000001 2009-03-05 15:50:00              Oh Holy Night          NaN        0 
6   1048550  user_000050 2007-01-25 08:51:00             I Hung My Head          3.0        1 
7   1048551  user_000050 2007-01-25 08:48:00                     Slider        591.0        1 
8   1048552  user_000050 2007-01-24 22:57:00                        Joy          4.0        0 
9   1048553  user_000050 2007-01-24 22:53:00               Crazy Eights          5.0        0 
10  1048554  user_000050 2007-01-24 22:48:00               Steady State          6.0        0 
11  1048555  user_000050 2007-01-24 22:42:00  Maple Leaves (7" Version)          NaN        0 

EDIT:

To get the first and last times on songs in the session and the length of the session, we can do the following:

session_length = df.groupby(['user', 'Session'])['timestamp'] \
                   .agg(['min', 'max']) \
                   .reset_index()

session_length['Length (mins)'] = (session_length['max'] -session_length['min']) / pd.Timedelta(minutes=1)

Which gives us:

          user  Session                 min                 max  Length (mins)
0  user_000001        0 2009-03-05 15:50:00 2009-03-05 15:56:00            6.0
1  user_000001        1 2009-04-05 14:42:00 2009-04-05 14:54:00           12.0
2  user_000001        2 2009-05-05 12:08:00 2009-05-05 12:08:00            0.0
3  user_000050        0 2007-01-24 22:42:00 2007-01-24 22:57:00           15.0
4  user_000050        1 2007-01-25 08:48:00 2007-01-25 08:51:00            3.0

Upvotes: 1

zipa
zipa

Reputation: 27879

This works, but I've reordered the data per user and timestamp ascending:

df['timestamp'] = pd.to_datetime(df['timestamp'], format='%d-%m-%y %H:%M')
df=df.sort_values(['user', 'timestamp'])
session = 1000

def counter(clause):
    global session
    if not clause:
        session += 1
    return session


df[['u', 't']] = df[['user', 'timestamp']].shift(-1).ffill()

df['session_id'] = df.apply(lambda x: counter(all([[x['user']==x['u']], (x['timestamp']-x['t']) >= -pd.Timedelta(minutes=20)])), axis=1)

Upvotes: 0

Related Questions