Reputation: 11
I have a dataframe that looks like the below with a user id, timestamp and a song name. The timestamp is when the user has started playing the song. A session is defined where each song is started within 20 minutes of the previous song’s start time. I need to create a list of the top 10 longest sessions, with the following information about each session: userid, timestamp of first and last songs in the session, and the list of songs played in the session provided in order of play. Could you please help?
user timestamp song
0 user_000001 05-05-09 12:08 The Start of Something
1 user_000001 04-05-09 14:54 My Sharona
2 user_000001 04-05-09 14:52 Caught by the river
3 user_000001 04-05-09 14:42 Swim
19 user_000001 03-05-09 15:56 Cover me
20 user_000001 03-05-09 15:50 Oh Holy Night
1048550 user_000050 25-01-07 8:51 I Hung My Head
1048551 user_000050 25-01-07 8:48 Slider
1048552 user_000050 24-01-07 22:57 Joy
1048553 user_000050 24-01-07 22:53 Crazy Eights
1048554 user_000050 24-01-07 22:48 Steady State
1048555 user_000050 24-01-07 22:42 Maple Leaves (7" Version)
Upvotes: 0
Views: 1045
Reputation: 10359
Without altering the order of your data, we can do the following:
import pandas as pd
from io import StringIO
data = StringIO('''id,user,timestamp,song
0,user_000001,05-05-09 12:08,The Start of Something
1,user_000001,04-05-09 14:54,My Sharona
2,user_000001,04-05-09 14:52,Caught by the river
3,user_000001,04-05-09 14:42,Swim
19,user_000001,03-05-09 15:56,Cover me
20,user_000001,03-05-09 15:50,Oh Holy Night
1048550,user_000050, 25-01-07 8:51,I Hung My Head
1048551,user_000050, 25-01-07 8:48,Slider
1048552,user_000050,24-01-07 22:57,Joy
1048553,user_000050,24-01-07 22:53,Crazy Eights
1048554,user_000050,24-01-07 22:48,Steady State
1048555,user_000050,24-01-07 22:42,Maple Leaves (7" Version)''')
def time_elapsed(grp, session_length):
grp['MinsElapsed'] = (grp['timestamp'] - grp['timestamp'].shift(-1)) / pd.Timedelta(minutes=1)
grp['Session'] = (grp['MinsElapsed'] > session_length)[::-1].astype(int).cumsum()[::-1]
return grp
df = pd.read_csv(data)
df['timestamp'] = pd.to_datetime(df['timestamp'])
df = df.groupby('user').apply(time_elapsed, session_length=20)
print(df)
We group by user, and work out the time difference between the row below (.shift(-1)
) in minutes. We then check if this column returns a value larger than the session length, convert that to an integer and apply a cumulative sum. As the times are in descending order, to get this to work properly we have to reverse the entire column before doing the cumulative sum, and then reset it afterwards.
This gives us:
id user timestamp song MinsElapsed Session
0 0 user_000001 2009-05-05 12:08:00 The Start of Something 43034.0 2
1 1 user_000001 2009-04-05 14:54:00 My Sharona 2.0 1
2 2 user_000001 2009-04-05 14:52:00 Caught by the river 10.0 1
3 3 user_000001 2009-04-05 14:42:00 Swim 44566.0 1
4 19 user_000001 2009-03-05 15:56:00 Cover me 6.0 0
5 20 user_000001 2009-03-05 15:50:00 Oh Holy Night NaN 0
6 1048550 user_000050 2007-01-25 08:51:00 I Hung My Head 3.0 1
7 1048551 user_000050 2007-01-25 08:48:00 Slider 591.0 1
8 1048552 user_000050 2007-01-24 22:57:00 Joy 4.0 0
9 1048553 user_000050 2007-01-24 22:53:00 Crazy Eights 5.0 0
10 1048554 user_000050 2007-01-24 22:48:00 Steady State 6.0 0
11 1048555 user_000050 2007-01-24 22:42:00 Maple Leaves (7" Version) NaN 0
EDIT:
To get the first and last times on songs in the session and the length of the session, we can do the following:
session_length = df.groupby(['user', 'Session'])['timestamp'] \
.agg(['min', 'max']) \
.reset_index()
session_length['Length (mins)'] = (session_length['max'] -session_length['min']) / pd.Timedelta(minutes=1)
Which gives us:
user Session min max Length (mins)
0 user_000001 0 2009-03-05 15:50:00 2009-03-05 15:56:00 6.0
1 user_000001 1 2009-04-05 14:42:00 2009-04-05 14:54:00 12.0
2 user_000001 2 2009-05-05 12:08:00 2009-05-05 12:08:00 0.0
3 user_000050 0 2007-01-24 22:42:00 2007-01-24 22:57:00 15.0
4 user_000050 1 2007-01-25 08:48:00 2007-01-25 08:51:00 3.0
Upvotes: 1
Reputation: 27879
This works, but I've reordered the data per user and timestamp ascending:
df['timestamp'] = pd.to_datetime(df['timestamp'], format='%d-%m-%y %H:%M')
df=df.sort_values(['user', 'timestamp'])
session = 1000
def counter(clause):
global session
if not clause:
session += 1
return session
df[['u', 't']] = df[['user', 'timestamp']].shift(-1).ffill()
df['session_id'] = df.apply(lambda x: counter(all([[x['user']==x['u']], (x['timestamp']-x['t']) >= -pd.Timedelta(minutes=20)])), axis=1)
Upvotes: 0