Reputation: 569
I want to perform Residual analysis, and i know that residuals equal the observed values minus the predicted ones. But i don't know should i calculate residuals from the training set or the test set ?
Should i use this:
import statsmodels.api as sm
# Making predictions
lm = sm.OLS(y_train,X_train).fit()
y_pred = lm.predict(X_train)
resid = y_train - y_pred.to_frame('price')
OR this:
import statsmodels.api as sm
# Making predictions
lm = sm.OLS(y_train,X_train).fit()
y_pred = lm.predict(X_test)
resid = y_test- y_pred.to_frame('price')
Upvotes: 1
Views: 2190
Reputation: 1
The accepted answer is quite misleading in my opinion. When you want to perform a residual analysis on your model to check the validity of the model's assumption, it should be performed on the train data since your model is fitted to the train data.
Performing residual analysis on the test set or train set does not matter much since often times they came from the same population. But, most of the times the residual error of a model against the test data will not have a non-zero mean. This can be due to difference of distribution, trends, etc. So, the first step would be to see whether the model is fitted well on the training data (by checking the assumptions of the model against the training data). Then, the second step would be to see whether the model generalize well to unseen data or not (by testing against the test data).
Upvotes: 0
Reputation: 39052
The residual error should be computed from the actual values (expected outcome) of the test set y_test
and the predicted values by the fitted model for X_test
. The model is fitted to the training set and then its accuracy is tested on the test set. This is how I see it intuitively, the main reason in the first place to formally call the two datasets as train
(for training) and then for testing (test
).
Specifically, use the second case
resid = y_test- y_pred.to_frame('price')
Upvotes: 3