Reputation: 17
Could you, please, explain me if I still need to do train_test split if I use cross validation technique? If I do, should I use cross-validation only on the train set? What is the best practice regarding cross validation and train_test_split?
from sklearn.linear_model import ElasticNet
from sklearn.pipeline import make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import cross_validate
numeric_features = X.select_dtypes(include=['int', 'float']).columns
categorical_features = X.select_dtypes(include=['object', 'category']).columns
preprocessor = ColumnTransformer(
transformers=[
('num', StandardScaler(), numeric_features),
('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
]
)
en = ElasticNet(alpha=0.1, l1_ratio = 0.3, n_jobs = -1)
model = make_pipeline(preprocessor, en)
cv_results = cross_validate(model, X, y, cv=10, scoring='neg_mean_absolute_error', return_estimator=True)
scores = - cv_results['test_score']
print(f'The mean mse of the model is {scores.mean(): .2f} +/- {scores.std(): .2f}')
Here I used cross_validate on the whole dataset and didn't test the model on the unseen data from the train_test_split. Can the result of scores.mean() be considered reliable in this case? Or the best practice would be to use train_test_split first, do cross validation only on the train set, and check the model on the test set?
Upvotes: 0
Views: 59