Confusion about sklearn cross_val_predict Method

Question

Consider this snippet code:

import pandas as pd

df = pd.read_csv('module_5_auto.csv')
df = df._get_numeric_data()


y_data = df['price']
x_data = df.drop('price',axis=1)


from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.10, random_state=1)

from sklearn.linear_model import LinearRegression
lre=LinearRegression()
lre.fit(x_train[['horsepower']], y_train)



from sklearn.model_selection import cross_val_score
cross_val_score(lre, x_data[['horsepower']], y_data, cv=4)


from sklearn.model_selection import cross_val_predict
cross_val_predict(lre, x_data[['horsepower']], y_data, cv=4)

I understand the function cross_val_score divides the data into folds (according to the number of cv), takes each distinct fold as a test_data and the remaining 3 folds as train_data , trains the model and then gets the score of the test_data before discarding the model and then outputs the 4 scores of the 4 distinct train_data folds.

But what about cross_val_predict ...what is exactly its output?... is it the prediction of the model of the highest score among the 4 models?... or is it the mean of the 4 predictions of the 4 models?

I found someone mentioning that

The function cross_val_predict has a similar interface to cross_val_score, but returns, for each element in the input, the prediction that was obtained for that element when it was in the test set.

what does "each element in the input" mean?... there are 4 folds, 4 training-sets and 4 test-sets...which one of them is the "element"?

Confusion about sklearn cross_val_predict Method

Answers (1)

Related Questions