Anna Gromovich
Anna Gromovich

Reputation: 17

Cross validation and/or train_test_split in scikit-learn?

Could you, please, explain me if I still need to do train_test split if I use cross validation technique? If I do, should I use cross-validation only on the train set? What is the best practice regarding cross validation and train_test_split?


from sklearn.linear_model import ElasticNet
from sklearn.pipeline import make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import cross_validate

numeric_features = X.select_dtypes(include=['int', 'float']).columns
categorical_features = X.select_dtypes(include=['object', 'category']).columns

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ]
)

en = ElasticNet(alpha=0.1, l1_ratio = 0.3, n_jobs = -1)

model = make_pipeline(preprocessor, en)

cv_results = cross_validate(model, X, y, cv=10, scoring='neg_mean_absolute_error', return_estimator=True)
scores = - cv_results['test_score']

print(f'The mean mse of the model is {scores.mean(): .2f} +/- {scores.std(): .2f}')

Here I used cross_validate on the whole dataset and didn't test the model on the unseen data from the train_test_split. Can the result of scores.mean() be considered reliable in this case? Or the best practice would be to use train_test_split first, do cross validation only on the train set, and check the model on the test set?

Upvotes: 0

Views: 59

Answers (0)

Related Questions