Reputation: 69
I'm working with some data where I have hourly observations for patients. In some cases, some of the features for a specific patient are completely empty. I'm trying to find a way to impute the data by using constant average that's based off a population subset of 50 other patients who have the same gender and a similar age. I've given a simplified look at the data below:
HR | O2Sat | Temp | Platelets | Age | Gender | PatientID |
---|---|---|---|---|---|---|
80 | 98 | 36.5 | NaN | 52 | 1 | A0 |
82 | 96 | 37.0 | NaN | 52 | 1 | A0 |
82 | 100 | 36.3 | 160 | 53 | 1 | A1 |
90 | 93 | 36.6 | 165 | 53 | 1 | A1 |
83 | 95 | 35.9 | 140 | 23 | 0 | A2 |
79 | 98 | 36.2 | 155 | 23 | 0 | A2 |
88 | 92 | 36.6 | 163 | 60 | 0 | A3 |
90 | 91 | 36.3 | 165 | 60 | 0 | A3 |
81 | 95 | 37.1 | NaN | 20 | 0 | A4 |
81 | 92 | 36.9 | NaN | 20 | 0 | A4 |
I've reordered the dataframe by age and have this code so far
data = data.sort_values(['Age']).groupby(['PatientID','Gender']).apply(lambda x: x.fillna(x.mean()))
But I know that that's going to use all of the available data to find the mean but I'm not sure how to limit it to 50 patients of a similar age.
Upvotes: 0
Views: 362
Reputation: 603
I think I get what you want now. You want to fill the gaps with matching records for the right age and category. I created a simple example to debug.
df = pd.DataFrame(data=dict(
category=['A', 'B', 'C'] * 3,
age=[15, 15, 15, 15, 15, 15, 25, 25, 25],
measurement=[1, 2, 3, np.nan, 5, np.nan, 7, 8, 9]))
df['age_bucket'] = pd.cut(df['age'], bins=np.arange(0, 100, 10))
agg = df.groupby(['category', 'age_bucket']).head(2).groupby(['category', 'age_bucket'])['measurement'].mean()
df = df.set_index(['category', 'age_bucket'])
df['measurement'] = df['measurement'].fillna(agg)
I updated the answer. If you convert the age to age bucket you get similar ages. With the .head() you can select only the first couple of records within a group. Unless you have an enormous data set I would suggest to just use all but that's up to you.
Upvotes: 1