Reputation: 11
I'm trying to use fillna() and transform() to impute some missing values in a column with respect to the 'release_year' and 'brand_name' of the phone, but after running my code I still have the same missing value counts.
Here are my missing value counts & percentages prior to running the code:
Here is the code I ran to impute 'main_camera_mp' and the result (just an FYI that I copied the above dataframe into df2):
df2['main_camera_mp'] = df2['main_camera_mp'].fillna(value = df2.groupby(['release_year','brand_name'])['main_camera_mp'].transform('mean'))
Upvotes: 0
Views: 100
Reputation: 10545
I guess your imputation method is not suited for your data, in that when main_camera_mp
is missing, it is missing for all entries in that release_year
-brand_name
group. Thus the series derived from the groupby object that you pass as the fill value will itself have missing values for those groups.
Here is a simple example of how this can happen:
import numpy as np
import pandas as pd
df2 = pd.DataFrame({'main_camera_mp': [1, 2, 3, np.nan, 5, 6, np.nan, np.nan],
'release_year': [2000, 2000, 2001, 2001, 2000, 2000, 2001, 2001],
'brand_name': ['a', 'b', 'a', 'b', 'a', 'b', 'a', 'b']})
df2['main_camera_mp'] = df2['main_camera_mp'].fillna(value =
df2.groupby(['release_year', 'brand_name'])['main_camera_mp'].transform('mean'))
df2
main_camera_mp release_year brand_name
0 1.0 2000 a
1 2.0 2000 b
2 3.0 2001 a
3 NaN 2001 b
4 5.0 2000 a
5 6.0 2000 b
6 3.0 2001 a
7 NaN 2001 b
Note that the value at index 6 was imputed as intended, but the other two missing values were not, because there is no non-missing value for their group.
Upvotes: 1