Data Imputation with Mean in Python

Question

I'm working with some data where I have hourly observations for patients. In some cases, some of the features for a specific patient are completely empty. I'm trying to find a way to impute the data by using constant average that's based off a population subset of 50 other patients who have the same gender and a similar age. I've given a simplified look at the data below:

HR	O2Sat	Temp	Platelets	Age	Gender	PatientID
80	98	36.5	NaN	52	1	A0
82	96	37.0	NaN	52	1	A0
82	100	36.3	160	53	1	A1
90	93	36.6	165	53	1	A1
83	95	35.9	140	23	0	A2
79	98	36.2	155	23	0	A2
88	92	36.6	163	60	0	A3
90	91	36.3	165	60	0	A3
81	95	37.1	NaN	20	0	A4
81	92	36.9	NaN	20	0	A4

I've reordered the dataframe by age and have this code so far

data = data.sort_values(['Age']).groupby(['PatientID','Gender']).apply(lambda x: x.fillna(x.mean()))

But I know that that's going to use all of the available data to find the mean but I'm not sure how to limit it to 50 patients of a similar age.

Rutger · Accepted Answer

I think I get what you want now. You want to fill the gaps with matching records for the right age and category. I created a simple example to debug.

df = pd.DataFrame(data=dict(
    category=['A', 'B', 'C'] * 3,
    age=[15, 15, 15, 15, 15, 15, 25, 25, 25],
    measurement=[1, 2, 3, np.nan, 5, np.nan, 7, 8, 9]))
df['age_bucket'] = pd.cut(df['age'], bins=np.arange(0, 100, 10))
agg = df.groupby(['category', 'age_bucket']).head(2).groupby(['category', 'age_bucket'])['measurement'].mean()
df = df.set_index(['category', 'age_bucket'])
df['measurement'] = df['measurement'].fillna(agg)

I updated the answer. If you convert the age to age bucket you get similar ages. With the .head() you can select only the first couple of records within a group. Unless you have an enormous data set I would suggest to just use all but that's up to you.

Data Imputation with Mean in Python

Answers (1)

Related Questions