Matheus Schaly
Matheus Schaly

Reputation: 303

Create new column based on how many rows, with condition based on another column, are within X days of current row date

My DF currently has just the first two columns DATE and RESULT and I want to create the third column N_RESULTS_EQUAL_1_PAST_60_DAYS:

DATE                        RESULT   N_RESULTS_EQUAL_1_PREVIOUS_60_DAYS
2018-12-26 23:13:43+00:00   1        0
2019-02-18 23:27:58+00:00   0        1
2019-02-28 15:02:33+00:00   0        0
2019-03-05 18:30:26+00:00   1        0
2019-05-21 14:54:52+00:00   1        0
2019-08-26 14:30:38+00:00   1        0
2019-09-19 15:51:01+00:00   1        1
2019-12-16 17:58:24+00:00   0        0
2021-02-23 03:50:33+00:00   0        0
2021-08-08 22:26:01+00:00   1        0
2021-09-01 18:04:46+00:00   0        1

For each row I want to check all the previous rows that are within 60 days of the current row and sum up how many RESULT == 1 these previous rows have. I can only think in a double for loop to solve this problem, which is not efficient. If there a more efficient way to solve this problem?

Edit 1: I made a mistake when first creating the N_RESULTS_EQUAL_1_PREVIOUS_60_DAYS column, I was not considering the RESULT == 1, I'm now fixing it.

Edit 2: I thought that having this simple example would be enough to solve the problem. However as it turns out, the best answer so far requires that the date is sorted, and I actually can't sort the date, here is why:

My actual problem have some IDs in it, and I must solve this problem to each individual ID. My DF is actually more like this:

DATE                        ID   RESULT   N_RESULTS_EQUAL_1_PREVIOUS_60_DAYS
2018-01-24 22:02:36+00:00   104  1        0
2018-05-15 18:27:17+00:00   104  0        0
2019-05-15 22:58:06+00:00   104  1        0
2019-07-22 15:17:55+00:00   104  1        0
2020-01-03 20:27:52+00:00   104  1        0
2018-12-26 23:13:43+00:00   105  1        0
2019-02-18 23:27:58+00:00   105  0        1
2019-02-28 15:02:33+00:00   105  0        0
2019-03-05 18:30:26+00:00   105  1        0
2019-05-21 14:54:52+00:00   105  1        0
2019-08-26 14:30:38+00:00   105  1        0
2019-09-19 15:51:01+00:00   105  1        1
2019-12-16 17:58:24+00:00   105  0        0
2021-02-23 03:50:33+00:00   105  0        0
2021-08-08 22:26:01+00:00   105  1        0
2021-09-01 18:04:46+00:00   105  0        1
2019-01-12 21:24:23+00:00   106  0        0
2019-05-28 08:03:55+00:00   106  1        0
2019-09-17 02:56:47+00:00   106  0        0
2020-05-06 17:20:55+00:00   106  0        0
2021-01-07 13:14:41+00:00   106  0        0

So, if I set my DATE column as index and then sort my index I end up messing my ID column.

Upvotes: 1

Views: 82

Answers (2)

furas
furas

Reputation: 143216

It seems @MethodGuy already described how to use rolling() when I was working on solution but I put my version because I have something else.

And I also get the same result as @MethodGuy which are different then N_RESULTS_EQUAL_1_PREVIOUS_60_DAYS so I checked how many days is between first and last date in rolling window.

I'm not sure but maybe it should be 61D (minus last date in function) to get past 60D


If you would have DATE as index then you could use rolling('60D') to create rolling window and work with only last 60 days - and then you can use .sum(), .count(), etc. You can also use .apply(func) to run own function which can skip current date

def result(data):
    return data[:-1].sum()

df['result'] = df['RESULT'].rolling('60D').apply(result).astype(int)

Minimal working code which shows .sum(), .count(), .apply() to calculate sum without current day. I also use .apply() to calculate days bettween first and last date in rolling window

text = '''DATE                        RESULT   60_DAYS
2018-12-26 23:13:43+00:00   1        0
2019-02-18 23:27:58+00:00   0        1
2019-02-28 15:02:33+00:00   0        1
2019-03-05 18:30:26+00:00   1        2
2019-05-21 14:54:52+00:00   1        0
2019-08-26 14:30:38+00:00   1        0
2019-09-19 15:51:01+00:00   1        1
2019-12-16 17:58:24+00:00   0        0
2021-02-23 03:50:33+00:00   0        0
2021-08-08 22:26:01+00:00   1        0
2021-09-01 18:04:46+00:00   0        1'''

import pandas as pd

import io
df = pd.read_csv(io.StringIO(text), sep='\s{2,}')
df.index = pd.to_datetime(df['DATE'])
del df['DATE']

print(df)

def result1(data):
    data = data[:-1]
    return data.sum()

def result2(data):
    data = data[:-1]
    return len(data[ data == 1 ])

def days(data):
    return (data.index[-1] - data.index[0]).days

window = df['RESULT'].rolling('60D')

df['sum']     = window.sum().astype(int)
df['count']   = window.count().astype(int)
df['result1'] = window.apply(result1).astype(int)
df['result2'] = window.apply(result2).astype(int)
df['days']    = window.apply(days).astype(int)

print(df)

Result:

                           RESULT  60_DAYS  sum  count  result1  result2  days
DATE                                                                          
2018-12-26 23:13:43+00:00       1        0    1      1        0        0     0
2019-02-18 23:27:58+00:00       0        1    1      2        1        1    54
2019-02-28 15:02:33+00:00       0        1    0      2        0        0     9
2019-03-05 18:30:26+00:00       1        2    1      3        0        0    14
2019-05-21 14:54:52+00:00       1        0    1      1        0        0     0
2019-08-26 14:30:38+00:00       1        0    1      1        0        0     0
2019-09-19 15:51:01+00:00       1        1    2      2        1        1    24
2019-12-16 17:58:24+00:00       0        0    0      1        0        0     0
2021-02-23 03:50:33+00:00       0        0    0      1        0        0     0
2021-08-08 22:26:01+00:00       1        0    1      1        0        0     0
2021-09-01 18:04:46+00:00       0        1    1      2        1        1    23

Upvotes: 1

MethodGuy
MethodGuy

Reputation: 31

Assuming that 'DATE' is a DatetimeIndex, you can groupby the 'ID' and then use .rolling() which now works for ragged datetimes:

df['N_RESULTS_EQUAL_1_PREVIOUS_60_DAYS'] = df.groupby('ID').rolling('60D').sum().astype('int').droplevel(0)

I'm using the the original index to add the column here, which works, but I think a more robust solution would be to use 'ID'and 'DATE' to merge the original df with the df that contains the 60 day sums, so you can also try that. Also, I understand you don't want to include the 'RESULT' itself, only the sum from previous ones. In that case, just subtract it:

df['N_RESULTS_EQUAL_1_PREVIOUS_60_DAYS'] = df['N_RESULTS_EQUAL_1_PREVIOUS_60_DAYS'] - df['RESULT']

Output:

DATE                        ID  RESULT  N_RESULTS_EQUAL_1_PREVIOUS_60_DAYS
2018-01-24 22:02:36+00:00   104 1       0
2018-05-15 18:27:17+00:00   104 0       0
2019-05-15 22:58:06+00:00   104 1       0
2019-07-22 15:17:55+00:00   104 1       0
2020-01-03 20:27:52+00:00   104 1       0
2018-12-26 23:13:43+00:00   105 1       0
2019-02-18 23:27:58+00:00   105 0       1
2019-02-28 15:02:33+00:00   105 0       0
2019-03-05 18:30:26+00:00   105 1       0
2019-05-21 14:54:52+00:00   105 1       0
2019-08-26 14:30:38+00:00   105 1       0
2019-09-19 15:51:01+00:00   105 1       1
2019-12-16 17:58:24+00:00   105 0       0
2021-02-23 03:50:33+00:00   105 0       0
2021-08-08 22:26:01+00:00   105 1       0
2021-09-01 18:04:46+00:00   105 0       1
2019-01-12 21:24:23+00:00   106 0       0
2019-05-28 08:03:55+00:00   106 1       0
2019-09-17 02:56:47+00:00   106 0       0
2020-05-06 17:20:55+00:00   106 0       0
2021-01-07 13:14:41+00:00   106 0       0

Upvotes: 2

Related Questions