learnToCode
learnToCode

Reputation: 393

Predicting a feature and using the predicted feature to predict the target

I am working on a Supervised ML classification use case and I have 5 features and a target variable. Suppose the 5 feature are A, B, C, D, E, F and the target variable being G. The E feature isn't a raw feature meaning it is predicted from some other features. I tried using that feature for model building and the classification metrics were pretty good. But now my boss said that the we cannot use Feature E as it isn't directly available, we need to predict that first and then use it for predicting target G.

Below are some of the things I tried:

  1. I tried building a model by removing feature E from my feature list, the metrics dropped meaning that the feature E is important.

  2. Boss says that the feature E is derived or dependent on feature A, B, C, D and F so we can use that to predict feature E and then use features A, B, C, D, E, F to predict G.

Here are my concerns:

  1. If feature E is dependent on features A, B, C, D and F then not using feature E while building model should not affect my metrics much.

  2. If I use features A, B, C, D and F to predict feature E and indeed use features A, B, C, D, E, F to predict G won't I be using correlated feature for model building because E is predicted using A, B, C, D and F. Using F won't add any extra information to my feature set.

My understanding is that if building model by removing feature E from my feature list dropped my metrics then it means that feature E comes from somewhere else i.e. other than features A, B, C, D, F.

I am not an experienced person in ML and these are my thoughts are about the problem.

Please let me know whether my thought process is right?

Upvotes: 0

Views: 2308

Answers (1)

CoMartel
CoMartel

Reputation: 3591

  1. If feature E is dependent on features A, B, C, D and F then not using feature E while building model should not affect my metrics much.

It really depends on the model you're using, but a simple example, let's imagine you are using a linear regression model, and the value you're trying to predict is y=x²

You can't find a fitting model with a simple linear regressor (A*x+B). However, you can create a new feature x' = x², and now you can fit y A*x'+b . So a feature dependant on a combination of the other features can help your model sometimes.

  1. If I use features A, B, C, D and F to predict feature E and indeed use features A, B, C, D, E, F to predict G won't I be using correlated feature for model building because E is predicted using A, B, C, D and F. Using F won't add any extra information to my feature set.

This question is more tricky, because it all really depends on the model you use to predict E, and the model you use to predict y. If you use a simple linear regressor for both, then yes, E will be a linear combination of the other variables and won't help predicting y.

But you could imagine predicting E using a non-linear model, like RandomForest, and that could help your final model.

Bottom line is : it doesn't cost much to try, just be careful using the same train/test for both models to avoid any leakage.

Upvotes: 1

Related Questions