Reputation: 303
My DF currently has just the first two columns DATE and RESULT and I want to create the third column N_RESULTS_EQUAL_1_PAST_60_DAYS:
DATE RESULT N_RESULTS_EQUAL_1_PREVIOUS_60_DAYS
2018-12-26 23:13:43+00:00 1 0
2019-02-18 23:27:58+00:00 0 1
2019-02-28 15:02:33+00:00 0 0
2019-03-05 18:30:26+00:00 1 0
2019-05-21 14:54:52+00:00 1 0
2019-08-26 14:30:38+00:00 1 0
2019-09-19 15:51:01+00:00 1 1
2019-12-16 17:58:24+00:00 0 0
2021-02-23 03:50:33+00:00 0 0
2021-08-08 22:26:01+00:00 1 0
2021-09-01 18:04:46+00:00 0 1
For each row I want to check all the previous rows that are within 60 days of the current row and sum up how many RESULT == 1 these previous rows have. I can only think in a double for loop to solve this problem, which is not efficient. If there a more efficient way to solve this problem?
Edit 1: I made a mistake when first creating the N_RESULTS_EQUAL_1_PREVIOUS_60_DAYS column, I was not considering the RESULT == 1, I'm now fixing it.
Edit 2: I thought that having this simple example would be enough to solve the problem. However as it turns out, the best answer so far requires that the date is sorted, and I actually can't sort the date, here is why:
My actual problem have some IDs in it, and I must solve this problem to each individual ID. My DF is actually more like this:
DATE ID RESULT N_RESULTS_EQUAL_1_PREVIOUS_60_DAYS
2018-01-24 22:02:36+00:00 104 1 0
2018-05-15 18:27:17+00:00 104 0 0
2019-05-15 22:58:06+00:00 104 1 0
2019-07-22 15:17:55+00:00 104 1 0
2020-01-03 20:27:52+00:00 104 1 0
2018-12-26 23:13:43+00:00 105 1 0
2019-02-18 23:27:58+00:00 105 0 1
2019-02-28 15:02:33+00:00 105 0 0
2019-03-05 18:30:26+00:00 105 1 0
2019-05-21 14:54:52+00:00 105 1 0
2019-08-26 14:30:38+00:00 105 1 0
2019-09-19 15:51:01+00:00 105 1 1
2019-12-16 17:58:24+00:00 105 0 0
2021-02-23 03:50:33+00:00 105 0 0
2021-08-08 22:26:01+00:00 105 1 0
2021-09-01 18:04:46+00:00 105 0 1
2019-01-12 21:24:23+00:00 106 0 0
2019-05-28 08:03:55+00:00 106 1 0
2019-09-17 02:56:47+00:00 106 0 0
2020-05-06 17:20:55+00:00 106 0 0
2021-01-07 13:14:41+00:00 106 0 0
So, if I set my DATE column as index and then sort my index I end up messing my ID column.
Upvotes: 1
Views: 82
Reputation: 143216
It seems @MethodGuy already described how to use rolling()
when I was working on solution but I put my version because I have something else.
And I also get the same result as @MethodGuy which are different then N_RESULTS_EQUAL_1_PREVIOUS_60_DAYS
so I checked how many days is between first and last date in rolling window
.
I'm not sure but maybe it should be 61D
(minus last date in function) to get past 60D
If you would have DATE
as index
then you could use rolling('60D')
to create rolling window
and work with only last 60 days - and then you can use .sum()
, .count()
, etc. You can also use .apply(func)
to run own function which can skip current date
def result(data):
return data[:-1].sum()
df['result'] = df['RESULT'].rolling('60D').apply(result).astype(int)
Minimal working code which shows .sum()
, .count()
, .apply()
to calculate sum without current day. I also use .apply()
to calculate days bettween first and last date in rolling window
text = '''DATE RESULT 60_DAYS
2018-12-26 23:13:43+00:00 1 0
2019-02-18 23:27:58+00:00 0 1
2019-02-28 15:02:33+00:00 0 1
2019-03-05 18:30:26+00:00 1 2
2019-05-21 14:54:52+00:00 1 0
2019-08-26 14:30:38+00:00 1 0
2019-09-19 15:51:01+00:00 1 1
2019-12-16 17:58:24+00:00 0 0
2021-02-23 03:50:33+00:00 0 0
2021-08-08 22:26:01+00:00 1 0
2021-09-01 18:04:46+00:00 0 1'''
import pandas as pd
import io
df = pd.read_csv(io.StringIO(text), sep='\s{2,}')
df.index = pd.to_datetime(df['DATE'])
del df['DATE']
print(df)
def result1(data):
data = data[:-1]
return data.sum()
def result2(data):
data = data[:-1]
return len(data[ data == 1 ])
def days(data):
return (data.index[-1] - data.index[0]).days
window = df['RESULT'].rolling('60D')
df['sum'] = window.sum().astype(int)
df['count'] = window.count().astype(int)
df['result1'] = window.apply(result1).astype(int)
df['result2'] = window.apply(result2).astype(int)
df['days'] = window.apply(days).astype(int)
print(df)
Result:
RESULT 60_DAYS sum count result1 result2 days
DATE
2018-12-26 23:13:43+00:00 1 0 1 1 0 0 0
2019-02-18 23:27:58+00:00 0 1 1 2 1 1 54
2019-02-28 15:02:33+00:00 0 1 0 2 0 0 9
2019-03-05 18:30:26+00:00 1 2 1 3 0 0 14
2019-05-21 14:54:52+00:00 1 0 1 1 0 0 0
2019-08-26 14:30:38+00:00 1 0 1 1 0 0 0
2019-09-19 15:51:01+00:00 1 1 2 2 1 1 24
2019-12-16 17:58:24+00:00 0 0 0 1 0 0 0
2021-02-23 03:50:33+00:00 0 0 0 1 0 0 0
2021-08-08 22:26:01+00:00 1 0 1 1 0 0 0
2021-09-01 18:04:46+00:00 0 1 1 2 1 1 23
Upvotes: 1
Reputation: 31
Assuming that 'DATE' is a DatetimeIndex, you can groupby the 'ID' and then use .rolling() which now works for ragged datetimes:
df['N_RESULTS_EQUAL_1_PREVIOUS_60_DAYS'] = df.groupby('ID').rolling('60D').sum().astype('int').droplevel(0)
I'm using the the original index to add the column here, which works, but I think a more robust solution would be to use 'ID'and 'DATE' to merge the original df with the df that contains the 60 day sums, so you can also try that. Also, I understand you don't want to include the 'RESULT' itself, only the sum from previous ones. In that case, just subtract it:
df['N_RESULTS_EQUAL_1_PREVIOUS_60_DAYS'] = df['N_RESULTS_EQUAL_1_PREVIOUS_60_DAYS'] - df['RESULT']
Output:
DATE ID RESULT N_RESULTS_EQUAL_1_PREVIOUS_60_DAYS
2018-01-24 22:02:36+00:00 104 1 0
2018-05-15 18:27:17+00:00 104 0 0
2019-05-15 22:58:06+00:00 104 1 0
2019-07-22 15:17:55+00:00 104 1 0
2020-01-03 20:27:52+00:00 104 1 0
2018-12-26 23:13:43+00:00 105 1 0
2019-02-18 23:27:58+00:00 105 0 1
2019-02-28 15:02:33+00:00 105 0 0
2019-03-05 18:30:26+00:00 105 1 0
2019-05-21 14:54:52+00:00 105 1 0
2019-08-26 14:30:38+00:00 105 1 0
2019-09-19 15:51:01+00:00 105 1 1
2019-12-16 17:58:24+00:00 105 0 0
2021-02-23 03:50:33+00:00 105 0 0
2021-08-08 22:26:01+00:00 105 1 0
2021-09-01 18:04:46+00:00 105 0 1
2019-01-12 21:24:23+00:00 106 0 0
2019-05-28 08:03:55+00:00 106 1 0
2019-09-17 02:56:47+00:00 106 0 0
2020-05-06 17:20:55+00:00 106 0 0
2021-01-07 13:14:41+00:00 106 0 0
Upvotes: 2