why r2_score is quite different between train_test_split and pipeline cross_val_score?

Question

I wonder why r2_score is quite different between train_test_split and pipeline cross_val_score? I suspect it's because the model can see the unknown words through CountVectorizer() in the pipeline. But based on concept of Pipeline, CountVectorizer() should only work on training set split by cross_val?

pipe=Pipeline([('Vect', CountVectorizer()), ('rf', RandomForestRegressor(random_state=1)) ])

X_train, X_test, y_train, y_test=train_test_split(df['X'], df['price'], shuffle= False, test_size=0.5)

reg=pipe.fit(X_train,y_train )
mypred= reg.predict(X_test)
r2_score(mypred, y_test)
# result is -0.2
cross_val_score(pipe,df['X'], df['price'],cv=2)
# result is about 0.3

why r2_score is quite different between train_test_split and pipeline cross_val_score?

Answers (1)

Related Questions