Reputation: 305
Imagine having a similar database, but a huge one with millions of rows and millions of names.
data = {'name': ['Alex', 'Ben', 'Marry','Alex', 'Ben', 'Marry'],
'job': ['teacher', 'doctor', 'engineer','teacher', 'doctor', 'engineer'],
'age': [27, 32, 78,27, 32, 78],
'weight': [160, 209, 130,164, 206, 132],
'meal_price': [8, 11, 27, 19, 7, 10],
'date': ['6-12-2022', '6-12-2022', '6-12-2022','6-13-2022', '6-13-2022', '6-13-2022']
}
I have added null values for the next day, and need to do imputation
df = pd.DataFrame(data) df
|name |job |age|weight |meal_price |date
|---|-------|-----------|---|-------|---|--------
|0 |Alex |teacher |27 |160 |8 |6-12-2022
|1 |Ben |doctor |32 |209 |11 |6-12-2022
|2 |Marry |engineer |78 |130 |27 |6-12-2022
|3 |Alex |teacher |27 |164 |19 |6-13-2022
|4 |Ben |doctor |32 |206 |7 |6-13-2022
|5 |Marry |engineer |78 |132 |10 |6-13-2022
|6 |Alex |teacher |NaN|NaN |NaN|6-14-2022
|7 |Ben |doctor |NaN|NaN |NaN|6-14-2022
|8 |Marry |engineer |NaN|NaN |NaN|6-14-2022
After doing imputation with miceforest, the density plot is great, and seems like it can impute the missing value well, but when I check the accuracy, it seems that it is not doing a good job imputing the data based on column 'name', so the general accuracy decreases. is there any way I can do the imputation based on a column like 'name'? Thanks
Upvotes: 0
Views: 434
Reputation: 126
I don't think this is a good use case for Multiple Imputation, or any other supervised imputation algorithm. Usually, one of the goals of Multiple Imputation is to purposefully add some jitter/randomness to the imputed values within the reasonable bounds of predictive uncertainty, so that the user has multiple imputed sets to work with. However, the imputed values in this dataset follow a specific formula, with a time element involved to boot.
Would it be easier for you to simply match age / weight based on a lookup? Are the name
and job
columns always unique identifiers for a person? Do you have any unique identifiers for a person in the database? These are all things to think about.
To answer your question - I believe the way to get the most accurate imputations in this case would be to follow these steps:
name
and job
These steps will essentially create a lookup on distinct name-job pairs, and assign imputation values to the average age / weight / meal price of that name-job pair.
To set the variable_schema
parameter so that age
, weight
, and mean_price
are only imputed using name
and job
as predictor variables. The following would work in this case:
variable_schema = {
"age": ["name", "job"],
"weight": ["name", "job"],
"meal_price": ["name", "job"]
}
Keep in mind that this won't work well if name and job aren't unique identifiers (which I have a feeling they aren't). This also won't work well if there are many different categories, which there probably are.
You could also just group by name-job pairs and assign values based on the average age / weight / meal_price of that name-job pair. This would probably be much easier to follow.
Upvotes: 2