Reputation: 4278
I have a pandas dataframe
which looks like this :
userID timestamp other_data
1 2017-06-19 17:14:00.000 foo
1 2017-06-19 19:16:00.000 bar
1 2017-06-19 23:26:00.000 ter
1 2017-06-20 01:16:00.000 lol
2 2017-06-20 12:00:00.000 ter
2 2017-06-20 13:15:00.000 foo
2 2017-06-20 17:15:00.000 bar
I'd like to add two columns, time_since_previous_point
and time_until_next_point
, but only between points of each user of course. I don't really care about the unit/format for now (as long as I can switch easily between them) :
userID timestamp time_since_previous time_until_next other data
1 2017-06-19 17:14:00.000 02:02:00.000 foo
1 2017-06-19 19:16:00.000 02:02:00.000 04:10:00.000 bar
1 2017-06-19 23:26:00.000 04:10:00.000 01:50:00.000 ter
1 2017-06-20 01:16:00.000 01:50:00.000 lol
2 2017-06-20 12:00:00.000 01:15:00.000 ter
2 2017-06-20 13:15:00.000 01:15:00.000 04:00:00.000 foo
2 2017-06-20 17:15:00.000 04:00:00.000 bar
How can I do that ? (The cells that are empty can be either empty
, NaN
, None
, depending on what seems best to you, knowing that next, I will be doing descriptive statistics on time_since_previous
and time_until_next
)
Please note that here, I denoted the userID
as being just one column, but in reality, my unique way to identify user is a combination of columns (country
+ userID
)
Upvotes: 0
Views: 315
Reputation: 428
I think what you are missing is a pandas shift
function and this answer: Pandas: Shift down values by one row within a group.
Combining both together you can do it like this:
from io import StringIO
import pandas as pd
csv = """userID,timestamp,other_data
1,2017-06-19 17:14:00.000,foo
1,2017-06-19 19:16:00.000,bar
1,2017-06-19 23:26:00.000,ter
1,2017-06-20 01:16:00.000,lol
2,2017-06-20 12:00:00.000,ter
2,2017-06-20 13:15:00.000,foo
2,2017-06-20 17:15:00.000,bar
"""
df = pd.read_csv(StringIO(csv))
gives:
userID timestamp other_data
0 1 2017-06-19 17:14:00.000 foo
1 1 2017-06-19 19:16:00.000 bar
2 1 2017-06-19 23:26:00.000 ter
3 1 2017-06-20 01:16:00.000 lol
4 2 2017-06-20 12:00:00.000 ter
5 2 2017-06-20 13:15:00.000 foo
6 2 2017-06-20 17:15:00.000 bar
First, you need to convert timestamp
to datetime
column:
df['timestamp'] = pd.to_datetime(df.timestamp)
And then combine groupby
and shift
methods:
df['time_since_previous'] = df['timestamp'] - df.groupby('userID')['timestamp'].shift(1)
df['time_until_next'] = df.groupby('userID')['timestamp'].shift(-1) - df['timestamp']
Which, ultimately, give you what you want:
userID timestamp other_data time_since_previous time_until_next
0 1 2017-06-19 17:14:00 foo NaT 02:02:00
1 1 2017-06-19 19:16:00 bar 02:02:00 04:10:00
2 1 2017-06-19 23:26:00 ter 04:10:00 01:50:00
3 1 2017-06-20 01:16:00 lol 01:50:00 NaT
4 2 2017-06-20 12:00:00 ter NaT 01:15:00
5 2 2017-06-20 13:15:00 foo 01:15:00 04:00:00
6 2 2017-06-20 17:15:00 bar 04:00:00 NaT
The only thing left for you is to deal with NaT
s.
Upvotes: 1