corianne1234
corianne1234

Reputation: 724

pandas groupby with condition depending on same column

I have a dataframe with ID, date and number columns and would like to create a new column that takes the mean of all numbers for this specific ID BUT only includes the numbers in the mean where date is smaller than the date of this row. How would I do this?

df = (pd.DataFrame({'ID':['1','1','1','1','2','2'],'number':['1','4','1','4','2','5'],
               'date':['2021-10-19','2021-10-16','2021-10-16','2021-10-15','2021-10-19','2021-10-10']})
      .assign(date = lambda x: pd.to_datetime(x.date))
      .assign(mean_no_from_previous_dts = lambda x: x[x.date<??].groupby('ID').number.transform('mean'))
     )

this is what i would like to get as output

    ID  number  date    mean_no_from_previous_dts
0   1   1   2021-10-19  3.0 = mean(4+1+4)
1   1   4   2021-10-16  2.5 = mean(4+1)
2   1   1   2021-10-16  4.0 = mean(1)
3   1   4   2021-10-15  0.0 = 0 (as it's the first entry for this date and ID - this number doesnt matter, can e something else)
4   2   2   2021-10-19  5.0 = mean(5)
5   2   5   2021-10-10  0.0 = 0 (as it's the first entry for this date and ID)

so for example the first entry of the column mean_no_from_previous_dts is the mean of (4+1+4): the first 4 comes from the column number and the 2nd row because 2021-10-16 (date in the 2nd row) is smaller than 2021-10-19 (date in the 1st row). The 1 comes from the 3rd row because 2021-10-16 is smaller than 2021-10-19. The second 4 comes from the 4th row because 2021-10-15 is smaller than 2021-10-19. This is for ID = 1 the the same for ID = 2

Upvotes: 0

Views: 108

Answers (1)

jezrael
jezrael

Reputation: 863611

Here is solution with numpy broadcasting per groups:

df = (pd.DataFrame({'ID':['1','1','1','1','2','2'],'number':['1','4','1','4','2','5'],
               'date':['2021-10-19','2021-10-16','2021-10-16','2021-10-15','2021-10-19','2021-10-10']})
      .assign(date = lambda x: pd.to_datetime(x.date), number = lambda x: x['number'].astype(int))
      )

def f(x):
    arr = x['date'].to_numpy()
    m = arr <= arr[:, None]
    #remove rows with same values - set mask to False
    np.fill_diagonal(m, False)
    #set greater values to `NaN` and get mean without NaNs
    m = np.nanmean(np.where(m, x['number'].to_numpy(), np.nan).astype(float), axis=1)
    #assign to new column
    x['no_of_previous_dts'] = m
    return x

#last value is set to 0 per groups
df =  df.groupby('ID').apply(f).fillna({'no_of_previous_dts':0})
     
print (df)
  ID  number       date  no_of_previous_dts
0  1       1 2021-10-19                 3.0
1  1       4 2021-10-16                 2.5
2  1       1 2021-10-16                 4.0
3  1       4 2021-10-15                 0.0
4  2       2 2021-10-19                 5.0
5  2       5 2021-10-10                 0.0

Upvotes: 2

Related Questions