Reputation: 141
I am building a xgboost model with hundreds of features. For features that highly correlated(pearson correlation) with each other, I am thinking to use feature importance(measuring by Gain) to drop the one with low importance. My question: 1: Will correlation impact/biase feature importance (measuring by Gain)? 2: Is there any good way to remove highly correlated feature for ML models?
example: a's importance=120, b's importance=14, corr(a,b)=0.8. I am thinking to drop b because its importance=14. But is it correct?
Thank you.
Upvotes: 0
Views: 1761
Reputation: 2670
Correlation definitely impacts feature importance. Meaning that if the features are highly correlated, there would be a high level of redundancy if you keep them all. Because two features are correlated means change in one will change the another. So there is no need to keep all of them right? As they are surely representative of one another and using a few of them you can hopefully classify your data well.
So in order to remove highly correlated features you can:
Upvotes: 1