Reputation: 293
I have a dataset in this form:
Customer_key purcahse_amount Date
12633 4435 08/07/2021
34243 7344 11/11/2021
54355 4642 10/11/2020
12633 6322 11/12/2021
Purchase_amount has few Nan values.... I want to replace the Nan values with mean of purcahse_amount
but mean should be calculated for that particlar customer_key only. For example, If you see customer_key=12633
is repeating in row 1 and row 4. So, for any row with customer_key=12633
, if there is a missing value of purcahse_amount
, then it be replaced with mean of that purchase_amount
of all rows where customer_key=12633
.
Upvotes: 0
Views: 26
Reputation:
You can groupby
"Customer_key" and then compute the mean of "purchase_amount" and transform it for the DataFrame (we need to transform it to use it in np.where
where the values to choose from must be broadcastable). Note that mean
method skips NaN values by default, so it transforms the mean of non-NaN values.
Then using np.where
, depending on if a "purchase_amount" is NaN or not fill in with group-specific mean or keep original.
means = df.groupby('Customer_key')['purchase_amount'].transform('mean')
df['purchase_amount'] = np.where(df['purchase_amount'].isna(), means, df['purchase_amount'])
or you can use fillna
:
df['purchase_amount'] = df['purchase_amount'].fillna(means)
For example, if you had df
as below:
Customer_key purchase_amount Date
0 12633 4435.0 08/07/2021
1 34243 7344.0 11/11/2021
2 54355 4642.0 10/11/2020
3 12633 6322.0 11/12/2021
3 12633 NaN 11/12/2021
the both of the above options produce:
Customer_key purchase_amount Date
0 12633 4435.0 08/07/2021
1 34243 7344.0 11/11/2021
2 54355 4642.0 10/11/2020
3 12633 6322.0 11/12/2021
3 12633 5378.5 11/12/2021
Upvotes: 1