Reputation: 724
I have a dataframe with ID
, date
and number
columns and would like to create a new column that takes the mean of all numbers for this specific ID
BUT only includes the numbers in the mean where date is smaller than the date of this row. How would I do this?
df = (pd.DataFrame({'ID':['1','1','1','1','2','2'],'number':['1','4','1','4','2','5'],
'date':['2021-10-19','2021-10-16','2021-10-16','2021-10-15','2021-10-19','2021-10-10']})
.assign(date = lambda x: pd.to_datetime(x.date))
.assign(mean_no_from_previous_dts = lambda x: x[x.date<??].groupby('ID').number.transform('mean'))
)
this is what i would like to get as output
ID number date mean_no_from_previous_dts
0 1 1 2021-10-19 3.0 = mean(4+1+4)
1 1 4 2021-10-16 2.5 = mean(4+1)
2 1 1 2021-10-16 4.0 = mean(1)
3 1 4 2021-10-15 0.0 = 0 (as it's the first entry for this date and ID - this number doesnt matter, can e something else)
4 2 2 2021-10-19 5.0 = mean(5)
5 2 5 2021-10-10 0.0 = 0 (as it's the first entry for this date and ID)
so for example the first entry of the column mean_no_from_previous_dts
is the mean of (4+1+4)
: the first 4
comes from the column number
and the 2nd row because 2021-10-16 (date in the 2nd row) is smaller than 2021-10-19 (date in the 1st row). The 1
comes from the 3rd row because 2021-10-16 is smaller than 2021-10-19. The second 4
comes from the 4th row because 2021-10-15 is smaller than 2021-10-19. This is for ID = 1
the the same for ID = 2
Upvotes: 0
Views: 108
Reputation: 863611
Here is solution with numpy broadcasting per groups:
df = (pd.DataFrame({'ID':['1','1','1','1','2','2'],'number':['1','4','1','4','2','5'],
'date':['2021-10-19','2021-10-16','2021-10-16','2021-10-15','2021-10-19','2021-10-10']})
.assign(date = lambda x: pd.to_datetime(x.date), number = lambda x: x['number'].astype(int))
)
def f(x):
arr = x['date'].to_numpy()
m = arr <= arr[:, None]
#remove rows with same values - set mask to False
np.fill_diagonal(m, False)
#set greater values to `NaN` and get mean without NaNs
m = np.nanmean(np.where(m, x['number'].to_numpy(), np.nan).astype(float), axis=1)
#assign to new column
x['no_of_previous_dts'] = m
return x
#last value is set to 0 per groups
df = df.groupby('ID').apply(f).fillna({'no_of_previous_dts':0})
print (df)
ID number date no_of_previous_dts
0 1 1 2021-10-19 3.0
1 1 4 2021-10-16 2.5
2 1 1 2021-10-16 4.0
3 1 4 2021-10-15 0.0
4 2 2 2021-10-19 5.0
5 2 5 2021-10-10 0.0
Upvotes: 2