sahmed
sahmed

Reputation: 69

Data Imputation with Mean in Python

I'm working with some data where I have hourly observations for patients. In some cases, some of the features for a specific patient are completely empty. I'm trying to find a way to impute the data by using constant average that's based off a population subset of 50 other patients who have the same gender and a similar age. I've given a simplified look at the data below:

HR O2Sat Temp Platelets Age Gender PatientID
80 98 36.5 NaN 52 1 A0
82 96 37.0 NaN 52 1 A0
82 100 36.3 160 53 1 A1
90 93 36.6 165 53 1 A1
83 95 35.9 140 23 0 A2
79 98 36.2 155 23 0 A2
88 92 36.6 163 60 0 A3
90 91 36.3 165 60 0 A3
81 95 37.1 NaN 20 0 A4
81 92 36.9 NaN 20 0 A4

I've reordered the dataframe by age and have this code so far

data = data.sort_values(['Age']).groupby(['PatientID','Gender']).apply(lambda x: x.fillna(x.mean()))

But I know that that's going to use all of the available data to find the mean but I'm not sure how to limit it to 50 patients of a similar age.

Upvotes: 0

Views: 362

Answers (1)

Rutger
Rutger

Reputation: 603

I think I get what you want now. You want to fill the gaps with matching records for the right age and category. I created a simple example to debug.

df = pd.DataFrame(data=dict(
    category=['A', 'B', 'C'] * 3,
    age=[15, 15, 15, 15, 15, 15, 25, 25, 25],
    measurement=[1, 2, 3, np.nan, 5, np.nan, 7, 8, 9]))
df['age_bucket'] = pd.cut(df['age'], bins=np.arange(0, 100, 10))
agg = df.groupby(['category', 'age_bucket']).head(2).groupby(['category', 'age_bucket'])['measurement'].mean()
df = df.set_index(['category', 'age_bucket'])
df['measurement'] = df['measurement'].fillna(agg)

I updated the answer. If you convert the age to age bucket you get similar ages. With the .head() you can select only the first couple of records within a group. Unless you have an enormous data set I would suggest to just use all but that's up to you.

Upvotes: 1

Related Questions