Reputation: 165
This is a Causal Inference related question, specifically on how to handle unbalanced variables. I applied an XGBoost model to create propensity scores for users (found that XGBoost had higher accuracy, precision & AUC compared to Logistic Regression). When estimating the Standardized Mean Differences (SMDs) for the balanced variables between the control & treatment, there is one feature (user age, ranked high in feature gain/importance) which is above the SMD threshold of 0.1. Some things I have tried to remediate this:
I'm stuck! How can I make sure that the SMD for user age is below 0.1? Unsure how to move forward with this confounding variable, as it is a highly important feature. Any help would be greatly appreciated.
Upvotes: 1
Views: 253
Reputation: 1
Divide your data into strata based on age, such as creating age bands (e.g., 20-30, 31-40, etc.). Within each stratum, calculate the propensity scores and perform the matching. This can help ensure that within each stratum, the distribution of age is similar across treatment and control groups, thus reducing the overall SMD for age.
If that doesn't work, I think there are a couple other approaches you could take.
Upvotes: 0