ATHARVA HIWASE
ATHARVA HIWASE

Reputation: 59

Fill nan values in one column based on other columns

I am working on a dataset which consists of average age of marriage. On this dataset I am doing data cleaning job. While performing this process, I came across a feature where I had to fill the 'NaN' values in the location column. But in location column there are multiple unique values and I want to fill the nan values in location. I need some suggestion on how to fill these Nan values in column which had many unique values.

enter image description here

I have attached the dataset for reference, DataSet

Upvotes: 1

Views: 567

Answers (1)

Andrey Lukyanenko
Andrey Lukyanenko

Reputation: 3851

I suggest doing it in 3 steps:

  1. Fill in the missing values of location with either the most common location or with a separate value "Unknown";
  2. Fill in the missing values of "age_of_marriage" with a mean value of this feature by location;
  3. If there are any missing values of "age_of_marriage" left, fill them in with the average value.
df = pd.read_csv('https://raw.githubusercontent.com/atharva07/Age-of-marriage/main/age_of_marriage_data.csv', sep=',')
df['location'] = df['location'].fillna('Unknown')
df['age_of_marriage'] = df.groupby(['location'])['age_of_marriage'].apply(lambda x: x.fillna(x.median()))
df['age_of_marriage'] = df['age_of_marriage'].fillna(df['age_of_marriage'].mean())

Upvotes: 3

Related Questions