Reputation: 71
I wonder why r2_score is quite different between train_test_split and pipeline cross_val_score? I suspect it's because the model can see the unknown words through CountVectorizer() in the pipeline. But based on concept of Pipeline, CountVectorizer() should only work on training set split by cross_val?
pipe=Pipeline([('Vect', CountVectorizer()), ('rf', RandomForestRegressor(random_state=1)) ])
X_train, X_test, y_train, y_test=train_test_split(df['X'], df['price'], shuffle= False, test_size=0.5)
reg=pipe.fit(X_train,y_train )
mypred= reg.predict(X_test)
r2_score(mypred, y_test)
# result is -0.2
cross_val_score(pipe,df['X'], df['price'],cv=2)
# result is about 0.3
Upvotes: 0
Views: 184
Reputation: 36617
r2_score(mypred, y_test)
is wrong.
You need to provide the true values as first input and predicted values as second. Correct that to:
r2_score(y_test, mypred)
and then check results.
Upvotes: 2