Reputation: 4152
I have a DataFrame
which has columns name age,salary
. There are some NaN
values too. I want to fill those values using Mean
and Median
.
Original DataFrame
age salary
0 20.0 NaN
1 45.0 22323.0
2 NaN 598454.0
3 32.0 NaN
4 NaN 48454.0
Fill missing age
with the mean()
and salary
with median()
of their respective columns using apply().
I used
df['age','salary'].apply({'age':lambda row:row.fillna(row.mean()), 'salary':lambda row:row.fillna(row.median()) })
It is showing Key error 'age','salary'
even after I use axis=1
Ecpected Output
age salary
0 20.000000 48454.0
1 45.000000 22323.0
2 32.333333 598454.0
3 32.000000 48454.0
4 32.333333 48454.0
Can someone show me how to do it properly and what is happening in the background?
Please tell if there are other ways too. I am learning Pandas from scratch
Upvotes: 0
Views: 58
Reputation: 4487
According to the documentation, the easiest way to do that you ask is to pass a dictionary as a value
parameter:
value : scalar, dict, Series, or DataFrame
Value to use to fill holes (e.g. 0), alternately a dict/Series/DataFrame of values specifying which value to use for each index (for a Series) or column (for a DataFrame). Values not in the dict/Series/DataFrame will not be filled. This value cannot be a list.
in your case the code will be next:
df.fillna(value={'age': df.age.mean(), 'salary': df.salary.median()}, inplace=True)
and gives:
age salary
0 20.000000 48454.0
1 32.333333 22323.0
2 45.000000 598454.0
3 32.333333 48454.0
4 32.000000 48454.0
5 32.333333 48454.0
Upvotes: 1
Reputation: 23799
How about computing the missing values before running apply? That is, compute the mean of age
and the median of salary
then use (note the extra []
brackets needed to operate on multiple columns)
median_salary = df['salary'].median()
mean_age = df['age'].mean()
df[['age','salary']].apply({'age': lambda r: r.fillna(mean_age), 'salary': lambda r: r.fillna(median_salary)})
Also note that this does not affect the dataframe but instead creates a new one so if you want to update the columns use something like:
df[['age', 'salary']] = df[['age', 'salary']].apply(...)
Or, in your case where you just want to fill in missing values, the best solution is probably:
r.fillna({'age': mean_age, 'salary': median_salary}, inplace=True)
Upvotes: 1