Luca Fichera
Luca Fichera

Reputation: 43

How does cross_val_score and gridsearchCV works?

I am new to python and I have been trying to figure out how gridsearchCV and cross_val_score work.

Finding odds results a set up a sort of validation experiment, but still I do not understand what I am doing wrong.

To try to simplify I am using gridsearchCV is the simplest possible way and try to validate and understand what is happening:

Here it is:

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, RobustScaler, QuantileTransformer
from sklearn.feature_selection import SelectKBest, f_regression, RFECV
from sklearn.decomposition import PCA
from sklearn.linear_model import RidgeCV,Ridge, LinearRegression
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.model_selection import GridSearchCV,KFold,TimeSeriesSplit,PredefinedSplit,cross_val_score
from sklearn.metrics import mean_squared_error,make_scorer,r2_score,mean_absolute_error,mean_squared_error
from math import sqrt

I create a cross validation object (for gridsearchCV and cross_val_score) and a train/test dataset for pipeline and simple linear regression. I have checked that the two dataset are identical:

train_indices = np.full((15,), -1, dtype=int)
test_indices = np.full((6,), 0, dtype=int)
test_fold = np.append(train_indices, test_indices)
kf = PredefinedSplit(test_fold)

for train_index, test_index in kf.split(X):
    print('TRAIN:', train_index, 'TEST:', test_index)
    X_train_kf = X[train_index]
    X_test_kf = X[test_index]

train_data = list(range(0,15))
test_data = list(range(15,21))

X_train, y_train=X[train_data,:],y[train_data]
X_test, y_test=X[test_data,:],y[test_data]

Here is what I do:

instantiate a simple linear model and use it with the manual set of data

lr=LinearRegression()
lm=lr.fit(X,y)
lmscore_train=lm.score(X_train,y_train) 

->r2=0.4686662249071524

lmscore_test=lm.score(X_test,y_test)

->r2 0.6264021467338086

now I try do do the exact same things using a pipeline:

pipe_steps = ([('est', LinearRegression())])
pipe=Pipeline(pipe_steps)
p=pipe.fit(X,y)
pscore_train=p.score(X_train,y_train) 

->r2=0.4686662249071524

pscore_test=p.score(X_test,y_test)

->r2 0.6264021467338086

LinearRegression and pipeline matches perfectly

Now I try to do the same by using cross_val_score using the predefined split kf

cv_scores = cross_val_score(lm, X, y, cv=kf)  

->r2 = -1.234474757883921470e+01?!?! (this is supposed to be the test score)

Now let's try gridsearchCV

scoring = {'r_squared':'r2'}
grid_parameters = [{}] 
gridsearch=GridSearchCV(p, grid_parameters, verbose=3,cv=kf,scoring=scoring,return_train_score='true',refit='r_squared')
gs=gridsearch.fit(X,y)
results=gs.cv_results_

from cv_results_ I get once again ->mean_test_r_squared->r2->-1.234474757883921292e+01

So cross_val_score and gridsearch in the end match one another, but the score is totally off and different from what should be.

Will you please help me out solving this puzzle?

Upvotes: 2

Views: 3549

Answers (1)

Vivek Kumar
Vivek Kumar

Reputation: 36599

cross_val_score and GridSearchCV will first split the data, train the model on the train data only and then score on test data.

Here you are training on the full data, and then scoring on test data. Hence you dont match the results of cross_val_score.

Instead of this:

lm=lr.fit(X,y)

Try this:

lm=lr.fit(X_train, y_train)

Same for pipeline:

Instead of p=pipe.fit(X,y), do this:

p=pipe.fit(X_train, y_train)

You can look at my answers for more description:-

Upvotes: 1

Related Questions