Reputation: 43
I'm working on the Titanic Dataset as my first Kaggle Project. But I get this User Warning And I'm trying to find out how to get rid of it.
So I made two Preprocessing sub-pipelines:
num_pipeline = Pipeline([
('imputer', SimpleImputer( strategy='median')),
('scaler', StandardScaler()) ])
cat_pipeline = Pipeline([
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder()) ])
My selected features for building the model are :
numeric_features = ['Age', 'SibSp', 'Parch', 'Fare']
categorical_features = ['Pclass', 'Sex', 'Embarked']
My Preprocessing Pipeline is :
preprocessor = ColumnTransformer(
transformers = [
('num', num_pipeline, numeric_features),
('cat', cat_pipeline, categorical_features)
])
In my final Pipeline I add a classifier:
clf = Pipeline([
('Preprocessor' , preprocessor),
('Classifier', DecisionTreeClassifier()) ])
Then i call cross_val_score to evaluate the model, and this is when I get the User Warning:
cross_val_score(clf, X_train, y_train, cv=3, scoring="accuracy")
ValueError: Found unknown categories ['Missing'] in column 2 during transform UserWarning,
array([ nan, 0.70403587, 0.74774775])
My guess is the cross_val_score gets the first fold WITHOUT the 'Missing' category and then test it on another WITH the 'Missing' category.Hence the Error. So I tried to drop the rows with the missing values in 'Embarked', but still get the error which is weird.
Upvotes: 2
Views: 845
Reputation: 185
I just incurred in this myself, and then I realized that it makes no sense to run OneHotEncoder
separately for each fold of the cross-validation.
The docs for ColumnTrasnformer say
Processing your data before passing it to scikit-learn might be problematic for one of the following reasons:
- Incorporating statistics from test data into the preprocessors makes cross-validation scores unreliable (known as data leakage), for example in the case of scalers or imputing missing values.
- You may want to include the parameters of the preprocessors in a parameter search.
So it is correct you put SimpleImputer
and StandardScaler
in the ColumnTransformer
which goes directly in the Pipeline
fed to cross_val_score
: they will be refitted indipendently for each fold, avoiding using information from the validation set.
However OneHotEncoder
should be run in a separate step, before feeding the data to cross_val_score
, i.e. X_train
should already contain one-hot variables rather than categorical ones.
In this way not only will you avoid this Warning (due as you already mention to seeing a class in the validation set not present in the fold's training set) but also you will keep the one-hot variables consistent, as they should be. In other words you don't risk the one-hot variables get ordered or named differently for each fold.
Upvotes: 4
Reputation: 795
I hope you did the following:
Upvotes: 0