Reputation: 341
I'm trying to solve an ML problem where the target variable is numeric, let's say the pollution level in a city. But the client is not interested in predicting the actual amount of pollutants, they are just interested in knowing whether the pollution level is high or low based on an agreed upon threshold. (High if the PM2.5 level is above 200, Low otherwise).
Should I treat it as a regression problem and take the numeric PM2.5 levels as target or as a classification problem where I make another feature of high/low pollution level based on the threshold and use that binary variable as a target? What are the advantages and disadvantages of both and What impact it can have on accuracy, if any?
Upvotes: 1
Views: 457
Reputation: 16966
I would suggest going with classification model, if your client is not interested in knowing the actual values.
You convert your target variable into binary values using this approach and follow the classification path.
The classification will have high chance of better accuracy because the model concentrates more on the classification boundary whereas regression model might get biased towards trying to predict outlier/noisy datapoints correctly!
Upvotes: 1