Reputation: 400
I expected both methods to return rather similar errors, can someone point me to the mistake please?
Calculating RMSE...
rf = RandomForestRegressor(random_state=555, n_estimators=100, max_depth=8)
rf_preds = cross_val_predict(rf, train_, targets, cv=7, n_jobs=7)
print("RMSE Score using cv preds: {:0.5f}".format(metrics.mean_squared_error(targets, rf_preds, squared=False)))
scores = cross_val_score(rf, train_, targets, cv=7, scoring='neg_root_mean_squared_error', n_jobs=7)
print("RMSE Score using cv_score: {:0.5f}".format(scores.mean() * -1))
RMSE Score using cv preds: 0.01658
RMSE Score using cv_score: 0.01073
Upvotes: 1
Views: 1236
Reputation: 434
Adding to the good work of @desertnaut...
X, y = make_regression(n_samples=100, n_features=4, n_informative=2,
random_state=42, shuffle=False)
rf = RandomForestRegressor(max_depth=2, random_state=0)
kf = KFold(n_splits=3)
rf_preds = cross_val_predict(rf, X, y, cv=kf, n_jobs=5)
print("RMSE Score using cv preds:\t", "{:0.5f}".format(mean_squared_error(y, rf_preds, squared=False)))
scores = cross_val_score(rf, X, y, cv=kf, scoring='neg_root_mean_squared_error', n_jobs=5)
print("RMSE Score using cv_score:\t", "{:0.5f}".format(scores.mean() * -1))
print("RMSE Score using cv_rf_preds:\t", "{:0.5f}".format(np.array([mean_squared_error(y[x[1]], rf_preds[x[1]], squared=False) for x in kf.split(X, y)]).mean()))
We can see the results are the same as in the loop derived score.
RMSE Score using cv preds: 23.00883
RMSE Score using cv_score: 21.88026
RMSE Score using cv_rf_preds: 21.88026
For a deeper dive look here: https://stackoverflow.com/a/77847608/4241746
Accept for very specific cases, cross_val_precict
is not really meant for scoring. It can been seen more as an intermediate step, build models and make predictions with the models. The next step is scoring the models with these predictions and true values.
Upvotes: 0
Reputation: 60321
There are two issues here, both of which are mentioned in the documentation of cross_val_predict
:
Results can differ from
cross_validate
andcross_val_score
unless all tests sets have equal size and the metric decomposes over samples.
The first is to make all sets (training and test) the same in both cases, which is not the case in your example. To do so, we need to employ the kfold
method in order to define our CV folds, and then use these same folds in both cases. Here is an example with dummy data:
from sklearn.datasets import make_regression
from sklearn.model_selection import KFold, cross_val_score, cross_val_predict
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
X, y = make_regression(n_samples=2000, n_features=4, n_informative=2,
random_state=42, shuffle=False)
rf = RandomForestRegressor(max_depth=2, random_state=0)
kf = KFold(n_splits=5)
rf_preds = cross_val_predict(rf, X, y, cv=kf, n_jobs=5)
print("RMSE Score using cv preds: {:0.5f}".format(mean_squared_error(y, rf_preds, squared=False)))
scores = cross_val_score(rf, X, y, cv=kf, scoring='neg_root_mean_squared_error', n_jobs=5)
print("RMSE Score using cv_score: {:0.5f}".format(scores.mean() * -1))
The result of the above code snippet (fully reproducible, since we have explicitly set all the necessary random seeds) is:
RMSE Score using cv preds: 15.16839
RMSE Score using cv_score: 15.16031
So, we can see that the two scores are indeed similar, but still not identical.
Why is that? The answer lies in the rather cryptic second part of the quoted sentence above, i.e. the RMSE score does not decompose over samples (to be honest, I don't know any ML score that it does).
In simple words, while cross_val_predict
computes the RMSE strictly according to its definition, i.e. (pseudocode):
RMSE = square_root([(y[1] - y_pred[1])^2 + (y[2] - y_pred[2])^2 + ... + (y[n] - y_pred[n])^2]/n)
where n
is the number of samples, the cross_val_score
method does not do exactly that; what it does instead is that it computes the RMSE for each one of the k
CV folds, and then averages these k
values, i.e. (pseudocode again):
RMSE = (RMSE[1] + RMSE[2] + ... + RMSE[k])/k
And exactly because the RMSE is not decomposable over the samples, these two values, although close, are not identical.
We can actually demonstrate that this is the case indeed, by doing the CV procedure manually and emulating the RMSE calculation as done by cross_val_score
and described above, i.e.:
import numpy as np
RMSE__cv_score = []
for train_index, val_index in kf.split(X):
rf.fit(X[train_index], y[train_index])
pred = rf.predict(X[val_index])
err = mean_squared_error(y[val_index], pred, squared=False)
RMSE__cv_score.append(err)
print("RMSE Score using manual cv_score: {:0.5f}".format(np.mean(RMSE__cv_score)))
The result being:
RMSE Score using manual cv_score: 15.16031
i.e. identical with the one returned by cross_val_score
above.
So, if we want to be very precise, the truth is that the correct RMSE (i.e. calculated exactly according to its definition) is the one returned by cross_val_predict
; cross_val_score
returns an approximation of it. But in practice, we often find that the difference is not that significant, so we can also use cross_val_score
if it is more convenient.
Upvotes: 3