Reputation: 143
bear with me here, i'm new to coding and stack overflow - so i may not be able to describe this succinctly.
i have 2 dataframes, df1 which is being read from a CSV, and has a column (named 'total_income' with NaN values that i want to fill in with a mean based on categorical values in another column (named 'income type').
i created a second data frame with just those two columns, to do a groupby on the 'income_type' column and then found the mean of the 'total_income' column for each unique value in the 'income type' column
i used this:
#groupby types of employment, calculate/fill in the mean
df2 = df1[['income_type','total_income']]
df2 = df2.sort_values('income_type').groupby('income_type').mean().reset_index()
my output is exactly what i am looking for:
income_type total_income
0 business 32386.793835
1 civil servant 27343.729582
2 employee 25820.841683
3 entrepreneur 79866.103000
4 paternity / maternity leave 8612.661000
5 retiree 21940.394503
6 student 15712.260000
7 unemployed 21014.360500
however, now that i have these outputs, i am confused as to how to apply them back to df1 in the same associated columns - so, for all rows that have 'employee' in df1's 'income_type' i want to apply the value 25820.84 (index #2 from df2: 2 employee 25820.841683
I know i can do this one by one by setting variables for each income type,total_income pair, but if i could do it in a loop or function that would be much more tidy.
here is the first 10 columns of df1:
children days_employed dob education edu_id family_status family_sts_id gender income_type debt total_income
1 -8437.673028 42 bachelor's degree 0 married 0 F employee 0 40620.102
1 -4024.803754 36 secondary education 1 married 0 F employee 0 17932.802
0 -5623.422610 33 Secondary Education 1 married 0 M employee 0 23341.752
3 -4124.747207 32 secondary education 1 married 0 M employee 0 42820.568
0 340266.072047 53 secondary education 1 civil partnership 1 F retiree 0 25378.572
0 -926.185831 27 bachelor's degree 0 civil partnership 1 M business 0 40922.170
0 -2879.202052 43 bachelor's degree 0 married 0 F business 0 38484.156
0 -152.779569 50 SECONDARY EDUCATION 1 married 0 M employee 0 21731.829
2 -6929.865299 35 BACHELOR'S DEGREE 0 civil partnership 1 F employee 0 15337.093
0 -2188.756445 41 secondary education 1 married 0 M employee 0 23108.150
I asked another question the other day that i tried to figure out how to apply to this:Link Here
But i am still struggling with how to take it from one df to the other in a loop with a replace. any assistance would be greatly appreciated!
Upvotes: 0
Views: 117
Reputation: 4929
You can take a way simpler approach like:
df['total_income'] = df.groupby('income_type')['total_income'].transform(lambda x: x.fillna(x.mean()))
That said, if you really want to create a second dataframe as you've done, you can leave it as pd.Series and use it to update the original df1
like this:
df2 = df1[['income_type','total_income']].copy()
income_mapper = df2.sort_values('income_type').groupby('income_type')['total_income'].mean()
df1.loc[df1.total_income.isna(), 'total_income'] = df1.loc[df1.total_income.isna(), 'income_type'].map(income_mapper)
Upvotes: 1