seuadr
seuadr

Reputation: 143

python - applying fillna data from 1 dataframe to null values in another

bear with me here, i'm new to coding and stack overflow - so i may not be able to describe this succinctly.

i have 2 dataframes, df1 which is being read from a CSV, and has a column (named 'total_income' with NaN values that i want to fill in with a mean based on categorical values in another column (named 'income type').

i created a second data frame with just those two columns, to do a groupby on the 'income_type' column and then found the mean of the 'total_income' column for each unique value in the 'income type' column

i used this:

#groupby types of employment, calculate/fill in the mean
df2 = df1[['income_type','total_income']]
df2 = df2.sort_values('income_type').groupby('income_type').mean().reset_index()

my output is exactly what i am looking for:

                       income_type  total_income
0                     business  32386.793835
1                civil servant  27343.729582
2                     employee  25820.841683
3                 entrepreneur  79866.103000
4  paternity / maternity leave   8612.661000
5                      retiree  21940.394503
6                      student  15712.260000
7                   unemployed  21014.360500

however, now that i have these outputs, i am confused as to how to apply them back to df1 in the same associated columns - so, for all rows that have 'employee' in df1's 'income_type' i want to apply the value 25820.84 (index #2 from df2: 2 employee 25820.841683

I know i can do this one by one by setting variables for each income type,total_income pair, but if i could do it in a loop or function that would be much more tidy.

here is the first 10 columns of df1:

children days_employed dob   education      edu_id family_status family_sts_id gender income_type  debt  total_income
   1     -8437.673028   42  bachelor's degree      0       married          0      F    employee     0     40620.102   
   1     -4024.803754   36  secondary education    1       married          0      F    employee     0     17932.802   
   0     -5623.422610   33  Secondary Education    1       married          0      M    employee     0     23341.752   
   3     -4124.747207   32  secondary education    1       married          0      M    employee     0     42820.568   
   0      340266.072047 53  secondary education    1       civil partnership    1      F     retiree     0     25378.572   
   0     -926.185831    27  bachelor's degree      0       civil partnership    1      M    business     0     40922.170   
   0     -2879.202052   43  bachelor's degree      0       married          0      F    business     0     38484.156   
   0     -152.779569    50  SECONDARY EDUCATION    1       married          0      M    employee     0     21731.829   
   2     -6929.865299   35  BACHELOR'S DEGREE      0       civil partnership    1      F    employee     0     15337.093   
   0     -2188.756445   41  secondary education    1       married          0      M    employee     0     23108.150 

I asked another question the other day that i tried to figure out how to apply to this:Link Here

But i am still struggling with how to take it from one df to the other in a loop with a replace. any assistance would be greatly appreciated!

Upvotes: 0

Views: 117

Answers (1)

You can take a way simpler approach like:

df['total_income'] = df.groupby('income_type')['total_income'].transform(lambda x: x.fillna(x.mean()))

That said, if you really want to create a second dataframe as you've done, you can leave it as pd.Series and use it to update the original df1 like this:

df2 = df1[['income_type','total_income']].copy()
income_mapper = df2.sort_values('income_type').groupby('income_type')['total_income'].mean()
df1.loc[df1.total_income.isna(), 'total_income'] = df1.loc[df1.total_income.isna(), 'income_type'].map(income_mapper)

Upvotes: 1

Related Questions