mnm
mnm

Reputation: 305

python miceforest imputation works well as density plot, but doesn't have high accuracy

Imagine having a similar database, but a huge one with millions of rows and millions of names.

data = {'name':  ['Alex', 'Ben', 'Marry','Alex', 'Ben', 'Marry'],
        'job': ['teacher', 'doctor', 'engineer','teacher', 'doctor', 'engineer'],
        'age': [27, 32, 78,27, 32, 78],
        'weight': [160, 209, 130,164, 206, 132],
        'meal_price': [8, 11, 27, 19, 7, 10],
        'date': ['6-12-2022', '6-12-2022', '6-12-2022','6-13-2022', '6-13-2022', '6-13-2022']
        }

I have added null values for the next day, and need to do imputation

df = pd.DataFrame(data) df

    |name   |job        |age|weight |meal_price |date
|---|-------|-----------|---|-------|---|--------
|0  |Alex   |teacher    |27 |160    |8  |6-12-2022
|1  |Ben    |doctor     |32 |209    |11 |6-12-2022
|2  |Marry  |engineer   |78 |130    |27 |6-12-2022
|3  |Alex   |teacher    |27 |164    |19 |6-13-2022
|4  |Ben    |doctor     |32 |206    |7  |6-13-2022
|5  |Marry  |engineer   |78 |132    |10 |6-13-2022
|6  |Alex   |teacher    |NaN|NaN    |NaN|6-14-2022
|7  |Ben    |doctor     |NaN|NaN    |NaN|6-14-2022
|8  |Marry  |engineer   |NaN|NaN    |NaN|6-14-2022

After doing imputation with miceforest, the density plot is great, and seems like it can impute the missing value well, but when I check the accuracy, it seems that it is not doing a good job imputing the data based on column 'name', so the general accuracy decreases. is there any way I can do the imputation based on a column like 'name'? Thanks

Upvotes: 0

Views: 434

Answers (1)

Suspicious_Gardener
Suspicious_Gardener

Reputation: 126

I don't think this is a good use case for Multiple Imputation, or any other supervised imputation algorithm. Usually, one of the goals of Multiple Imputation is to purposefully add some jitter/randomness to the imputed values within the reasonable bounds of predictive uncertainty, so that the user has multiple imputed sets to work with. However, the imputed values in this dataset follow a specific formula, with a time element involved to boot.

Would it be easier for you to simply match age / weight based on a lookup? Are the name and job columns always unique identifiers for a person? Do you have any unique identifiers for a person in the database? These are all things to think about.

To answer your question - I believe the way to get the most accurate imputations in this case would be to follow these steps:

  1. Set the variable schema so that values are only imputed using name and job
  2. Grow less trees (maybe even just 1 tree)
  3. Set the mean matching candidates to 0, so the binary trees grown above assign the node value as the imputed value.
  4. set the cat_smooth parameter to 0.0
  5. set the min_samples_in_leaf to 1

These steps will essentially create a lookup on distinct name-job pairs, and assign imputation values to the average age / weight / meal price of that name-job pair.

To set the variable_schema parameter so that age, weight, and mean_price are only imputed using name and job as predictor variables. The following would work in this case:

variable_schema = {
    "age": ["name", "job"],
    "weight": ["name", "job"],
    "meal_price": ["name", "job"]
}

Keep in mind that this won't work well if name and job aren't unique identifiers (which I have a feeling they aren't). This also won't work well if there are many different categories, which there probably are.

You could also just group by name-job pairs and assign values based on the average age / weight / meal_price of that name-job pair. This would probably be much easier to follow.

Upvotes: 2

Related Questions