Antoine F
Antoine F

Reputation: 191

How to fill missing values relative to a value from another column

I'd like to fill missing values with conditions relative to the country: For example, I'd want to replace China's missing values with the mean of Age and for USA it's the median of Age. For now, I don't want to touch of EU's missing values. How could I do realise it ? Below the dataframe

import pandas as pd
data = [['USA', ], ['EU', 15], ['China', 35],
       ['USA', 45], ['EU', 30], ['China', ],
       ['USA', 28], ['EU', 26], ['China', 78],
       ['USA', 65], ['EU', 53], ['China', 66],
       ['USA', 32], ['EU', ], ['China', 14]]  

# Create the pandas DataFrame 
df = pd.DataFrame(data, columns = ['Country', 'Age'])
df.head(10)

  Country   Age
0     USA   NaN
1      EU  15.0
2   China  35.0
3     USA  45.0
4      EU  30.0
5   China   NaN
6     USA  28.0
7      EU  26.0
8   China  78.0
9     USA  65.0
10    EU    NaN

Thank you

Upvotes: 1

Views: 821

Answers (3)

BICube
BICube

Reputation: 4681

Not sure if this is the best way to do it but it is one way to do it

age_series = df['Age'].copy()
df.loc[(df['Country'] == 'China') & (df['Age'].isnull()), 'Age'] = age_series.mean()
df.loc[(df['Country'] == 'USA') & (df['Age'].isnull()), 'Age'] = age_series.median()

Note that I copied the age column before hand so that you get the median of the original age series not after calculating the mean for the US. This is the final results

    Country     Age
0   USA     33.500000
1   EU      15.000000
2   China   35.000000
3   USA     45.000000
4   EU      30.000000
5   China   40.583333
6   USA     28.000000
7   EU      26.000000
8   China   78.000000
9   USA     65.000000
10  EU      53.000000
11  China   66.000000
12  USA     32.000000
13  EU      NaN
14  China   14.000000

Upvotes: 2

moys
moys

Reputation: 8033

May be you can try this

df['Age']=(np.where((df['Country'] == 'China') & (df['Age'].isnull()),df['Age'].mean()
          ,np.where((df['Country'] == 'USA') & (df['Age'].isnull()),df['Age'].median(),df['Age']))).round()

Output

   Country  Age
0   USA     34.0
1   EU      15.0
2   China   35.0
3   USA     45.0
4   EU      30.0
5   China   41.0
6   USA     28.0
7   EU      26.0
8   China   78.0
9   USA     65.0
10  EU      53.0
11  China   66.0
12  USA     32.0
13  EU      NaN
14  China   14.0

Upvotes: 1

Umar.H
Umar.H

Reputation: 23099

IIUC, we can create a function to handle this as it's not easily automated (although I may be wrong)

the idea is to pass in the country name & fill type (i.e mean median) you can extend the function to add in your agg types.

it returns a data frame that modifies yours, so you can use this to assign it back to your col

def missing_values(dataframe,country,fill_type):

    """
    takes 3 arguments, dataframe, country & fill_type:
    fill_type is the method used to fill `NA` values, mean, median, etc.
    """

    fill_dict = dataframe.loc[dataframe['Country'] == country]\
            .groupby("Country")["Age"].agg(
                    ["mean", "median"]).to_dict(orient='index')



    dataframe.loc[dataframe['Country'] == country, 'Age'] \
                 = dataframe['Age'].fillna(fill_dict[country][fill_type])
    return dataframe

print(missing_values(df,'China','mean')
    Country    Age
0      USA    NaN
1       EU  15.00
2    China  35.00
3      USA  45.00
4       EU  30.00
5    China  48.25
6      USA  28.00
7       EU  26.00
8    China  78.00
9      USA  65.00
10      EU  53.00
11   China  66.00
12     USA  32.00
13      EU    NaN
14   China  14.00

print(missing_values(df,'USA','median'))
   Country    Age
0      USA  38.50
1       EU  15.00
2    China  35.00
3      USA  45.00
4       EU  30.00
5    China  48.25
6      USA  28.00
7       EU  26.00
8    China  78.00
9      USA  65.00
10      EU  53.00
11   China  66.00
12     USA  32.00
13      EU    NaN
14   China  14.00

Upvotes: 0

Related Questions