random student
random student

Reputation: 775

How to achieve regression model without underfitting or overfitting

I have my university project and i'm given a dataset which almost all features have a very weak (only 1 feature has moderate correlation with the target) correlation with the target. It's distribution is not normal too. I already tried to apply simple model linear regression it caused underfitting, then i applied simple random forest regressor but it caused overfitting but when i applied random forest regressor with optimization with randomsearchcv it took time so long. Is there any way to get decent model with not-so-good dataset without underfitting or overfitting? or it's just not possible at all?

Upvotes: 0

Views: 587

Answers (1)

gust
gust

Reputation: 945

Well, to be blunt, if you could fit a model without underfitting or overfitting you would have solved AI completely.

Some suggestions, though:

Overfitting on random forests

  • Personally, I'd try to hack this route since you mention that your data is not strongly correlated. It's typically easier to fix overfitting than underfitting so that helps, too.

  • Try looking at your tree outputs. If you are using python, sci-kit learn's export_graphviz can be helpful.

  • Try reducing the maximum depth of the trees.

  • Try increasing the maximum number of a samples a tree must have in order to split (or similarly, the minimum number of samples a leaf should have).

  • Try increasing the number of trees in the RF.

Underfitting on linear regression

  • Add more parameters. If you have variables a, b, ... etc. adding their polynomial features, i.e. a^2, a^3 ... b^2, b^3 ... etc. may help. If you add enough polynomial features you should be able to overfit -- although that doesn't necessarily mean it will have a good fit on the train set (RMSE value).

  • Try plotting some of the variables against the value to predict (y). Perhaps you may be able to see a non-linear pattern (i.e. a logarithmic relationship).

  • Do you know anything about the data? Perhaps a variable that is the multiple, or the division between two variables may be a good indicator.

  • If you are regularizing (or if the software is automatically applying) your regression, try reducing the regularization parameter.

Upvotes: 3

Related Questions