Mario Mandušić
Mario Mandušić

Reputation: 1

XGBoost regression RMSE individual prediction

I have a simple regression problem with two independent variables and one dependent one. I tried linear regression from statsmodels and sk-learn, but I get the best results (R ^ 2 and RMSE) with XGBoost regressor.

On the new data set, RMSE is still in line with earlier results, but individual predictions are very different.

For example, the RMSE is 1000, and individual predictions vary from 20 to 3000. Thus, predictions are either almost perfectly accurate or strongly deviate in a few cases, but i don't know why is that.

My question is what is the cause of such variations in individual predictions?

Upvotes: -1

Views: 1638

Answers (1)

andrish
andrish

Reputation: 26

When testing your model with new data, it's normal to get some of the predictions wrong. Since RMSE is 1000 it means that, on average, the root of the difference between the actual and predicted values is 1000. You can have values that are predicted very well, as well as values that give a very large squared error. The reason for this could be overfitting. It could also be that the new data set contains data that is very different from the data the model was trained on. But since the RMSE is in line with earlier results, I understand that RMSE was around 1000 on the training set as well. Therefore I don't necessarily see a problem with the test set. What I would do is go through the preprocessing steps for the training data and make sure they're done correctly:

  • standardize the data and remove possible skewness
  • check for collinearity between independent variables (you only have 2, so it should be easy to do)
  • check to see if independent variables have an acceptable variance. If your variables don't vary too much for each new data point it could be that they are useless for explaining the dependent variable.

BTW, what is the R2 score for your regression? It should tell you how much of the variability of the target variable is explained by your model. A low R2 score should indicate that the regressors used aren't very useful in explaining your target variable.

You can use the sklearn function StandardScaler() to standaredize the data.

Upvotes: 1

Related Questions