Reputation: 775
I have my university project and i'm given a dataset
which almost all features have a very weak (only 1 feature has moderate correlation with the target) correlation with the target. It's distribution is not normal too. I already tried to apply simple model linear regression it caused underfitting
, then i applied simple random forest regressor
but it caused overfitting
but when i applied random forest regressor
with optimization with randomsearchcv
it took time so long. Is there any way to get decent model with not-so-good dataset
without underfitting or overfitting? or it's just not possible at all?
Upvotes: 0
Views: 587
Reputation: 945
Well, to be blunt, if you could fit a model without underfitting or overfitting you would have solved AI completely.
Some suggestions, though:
Personally, I'd try to hack this route since you mention that your data is not strongly correlated. It's typically easier to fix overfitting than underfitting so that helps, too.
Try looking at your tree outputs. If you are using python
, sci-kit learn
's export_graphviz
can be helpful.
Try reducing the maximum depth of the trees.
Try increasing the maximum number of a samples a tree must have in order to split (or similarly, the minimum number of samples a leaf should have).
Try increasing the number of trees in the RF.
Add more parameters. If you have variables a, b, ... etc. adding their polynomial features, i.e. a^2, a^3 ... b^2, b^3 ... etc. may help. If you add enough polynomial features you should be able to overfit -- although that doesn't necessarily mean it will have a good fit on the train set (RMSE value).
Try plotting some of the variables against the value to predict (y). Perhaps you may be able to see a non-linear pattern (i.e. a logarithmic relationship).
Do you know anything about the data? Perhaps a variable that is the multiple, or the division between two variables may be a good indicator.
If you are regularizing (or if the software is automatically applying) your regression, try reducing the regularization parameter.
Upvotes: 3