Reputation: 171
I am trying to do a date calculation counting days passing between events in a non-date column in pandas.
I have a pandas dataframe that looks something like this:
df = pd.DataFrame({'date':[
'01.01.2020','02.01.2020','03.01.2020','10.01.2020',
'01.01.2020','04.02.2020','20.02.2020','21.02.2020',
'01.02.2020','10.02.2020','20.02.2020','20.03.2020'],
'user_id':[1,1,1,1,2,2,2,2,3,3,3,3],
'other_val':[0,0,0,100,0,100,0,10,10,0,0,10],
'booly':[True, False, False, True,
True, False, False, True,
True, True, True, True]})
Now, I've been unable to figure out how to create a new column stating the number of days that passed between each True value in the 'booly' column, for each user. So for each row with a True in the 'booly' column, how many days is it until the next row with a True in the 'booly' column occurs, like so:
date user_id booly days_until_next_booly
01.01.2020 1 True 9
02.01.2020 1 False None
03.01.2020 1 False None
10.01.2020 1 True None
01.01.2020 2 True 51
04.02.2020 2 False None
20.02.2020 2 False None
21.01.2020 2 True None
01.02.2020 3 True 9
10.02.2020 3 True 10
20.02.2020 3 True 29
20.03.2020 3 True None
Upvotes: 1
Views: 87
Reputation: 586
(
df
# fist convert the date column to datetime format
.assign(date=lambda x: pd.to_datetime(x['date'], dayfirst=True))
# sort your dates
.sort_values('date')
# calculate the difference between subsequent dates
.assign(date_diff=lambda x: x['date'].diff(1).shift(-1))
# Groupby your booly column to calculate the cumulative days between True values
.assign(date_diff_cum=lambda x: x.groupby(x['booly'].cumsum())['date_diff'].transform('sum').where(x['booly'] == True))
)
Output:
date user_id other_val booly date_diff date_diff_cum
2020-01-01 2 0 True 1 days 9 days
2020-01-02 1 0 False 1 days NaT
2020-01-03 1 0 False 7 days NaT
2020-01-10 1 100 True 22 days 22 days
2020-02-01 1 0 True 0 days 0 days
2020-02-01 3 10 True 3 days 9 days
2020-02-04 2 10 False 6 days NaT
2020-02-10 3 0 True 10 days 10 days
2020-02-20 2 100 False 0 days NaT
2020-02-20 3 0 True 1 days 1 days
2020-02-21 2 0 True 28 days 28 days
2020-03-20 3 10 True NaT 0 days
Upvotes: 1
Reputation: 14113
# sample data
df = pd.DataFrame({'date':[
'01.01.2020','02.01.2020','03.01.2020','10.01.2020',
'01.01.2020','04.02.2020','20.02.2020','21.02.2020',
'01.02.2020','10.02.2020','20.02.2020','20.03.2020'],
'user_id':[1,1,1,1,2,2,2,2,3,3,3,3],
'other_val':[0,0,0,100,0,100,0,10,10,0,0,10],
'booly':[True, False, False, True,
True, False, False, True,
True, True, True, True]})
# convert data to date time format
df['date'] = pd.to_datetime(df['date'], dayfirst=True)
# use loc with groupby to calculate the difference between True values
df.loc[df['booly'] == True, 'days_until_next_booly'] = df.loc[df['booly'] == True].groupby('user_id')['date'].diff().shift(-1)
date user_id other_val booly days_until_next_booly
0 2020-01-01 1 0 True 9 days
1 2020-01-02 1 0 False NaT
2 2020-01-03 1 0 False NaT
3 2020-01-10 1 100 True NaT
4 2020-01-01 2 0 True 51 days
5 2020-02-04 2 100 False NaT
6 2020-02-20 2 0 False NaT
7 2020-02-21 2 10 True NaT
8 2020-02-01 3 10 True 9 days
9 2020-02-10 3 0 True 10 days
10 2020-02-20 3 0 True 29 days
11 2020-03-20 3 10 True NaT
Upvotes: 1