Waleed
Waleed

Reputation: 41

Confusion about sklearn cross_val_predict Method

Consider this snippet code:

import pandas as pd

df = pd.read_csv('module_5_auto.csv')
df = df._get_numeric_data()


y_data = df['price']
x_data = df.drop('price',axis=1)


from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.10, random_state=1)

from sklearn.linear_model import LinearRegression
lre=LinearRegression()
lre.fit(x_train[['horsepower']], y_train)



from sklearn.model_selection import cross_val_score
cross_val_score(lre, x_data[['horsepower']], y_data, cv=4)


from sklearn.model_selection import cross_val_predict
cross_val_predict(lre, x_data[['horsepower']], y_data, cv=4)

I understand the function cross_val_score divides the data into folds (according to the number of cv), takes each distinct fold as a test_data and the remaining 3 folds as train_data , trains the model and then gets the score of the test_data before discarding the model and then outputs the 4 scores of the 4 distinct train_data folds.

But what about cross_val_predict ...what is exactly its output?... is it the prediction of the model of the highest score among the 4 models?... or is it the mean of the 4 predictions of the 4 models?

I found someone mentioning that

The function cross_val_predict has a similar interface to cross_val_score, but returns, for each element in the input, the prediction that was obtained for that element when it was in the test set.

what does "each element in the input" mean?... there are 4 folds, 4 training-sets and 4 test-sets...which one of them is the "element"?

Upvotes: 2

Views: 1316

Answers (1)

afsharov
afsharov

Reputation: 5174

The quote you provided is by all means not from "someone" but from the official user guide on how to obtain predictions by cross-validation. There is also an interesting part missing that you did not include, but might help lifting your confusion:

The function cross_val_predict has a similar interface to cross_val_score, but returns, for each element in the input, the prediction that was obtained for that element when it was in the test set. Only cross-validation strategies that assign all elements to a test set exactly once can be used (otherwise, an exception is raised).

Before starting cross-validation, you only have one input (a.k.a dataset) that contains all samples (or as they call it here: elements). When performing cross-validation, all these samples will be split into k folds. In your case, k has been set to 4. So each sample ends up in one, and only one, of these folds.

The crucial part here is that each sample (or element) can only be predicted once during cross-validation, and that is when its corresponding fold is used as the test set. It is not possible, as you might have thought, that your sample gets predicted 4 times by four different models (or anything else in that regard).


In conclusion: there is only one prediction for each sample (or element) in the input during cross-validation, and this one prediction is returned by cross_val_predict.

Upvotes: 4

Related Questions