Is there a way to modelize a partial predictor in a classification problem with an unbalanced target?

Question

I would like to share with you a classification issue I faced during the modelling process. I have to create a model for an unbalanced binary target by 4 predictors where one of them has 45% of wrong values. This predictor must be in the model.

*** What I have in my data ?

Number of observation : 10 000
1 Target: binary variable let's call A -> Yes (38)/No (9 962)
4 Predictors: VarB (category) - VarC (category) - VarD (numeric) - VarE (category)
Issues: For 45% of the variable VarD, the values are wrongs. The remining (55%) have been corrected after manual treatments performed by an external team. Treatments made on the 55% ones changed the initial definition of the variable D. Also, there are no ways to correct the 45% wrongs. Plus, this concerns 48% (18) of the category (yes) of the target. Let is call this new remediate variable RVarD. (variable where 55% have been corrected and 45% are wrongs)
Constraint : Built a binary model where I must use the remediated RVarD as one of a predictor and I cannot use black box models/tools or too sophisticates approaches.

*** Solutions with the pros/cons :

A model with the remediate variable (VarD) and others after dropping the 45% wrongs of RVarD in the dataset. So we will have 5500 observations - target (yes - 20 / No - 5480)

Pros: Easy way
Cons: Too low number in the category (yes) of the target (20 yes). Instability for the performance because of the low number in the target

Find a way to impute the 45% wrongs of the new remediate variable RVarD based on the distribution of the 55% corrected. I can also discretize and assign the category to the 45% wrongs based on the 55% right.

Pros: Simple and quick way to impute
Cons: As the definition changed it looks like I compare bananas and apple.

1 model without the new remediate variable (VarD) plus use the coefficients for predictions(probs). A second model with only the VarD for the 55% observations right. Compare these two probs and find a scaling factor to link the two models.

Cons: Very complicated and hard to define properly the scaling factor and link the two models

As the 2/, modelized a first model without the remediate variable RVarD and use the coefficient for prediction first. Then, find a way to use the mandatory variable RVarD by business rules or additional layer.

Pros: A statistic model is secured
Cons: Complicated and hard to define the best rule for the 45% wrong data of the remediate variable RVarD

Which one is more realistic or how could I improve it ? Feel free to propose different approach, I am open for discussion.

Thanks a lot.

Is there a way to modelize a partial predictor in a classification problem with an unbalanced target?

Answers (0)

Related Questions