MedCh
MedCh

Reputation: 43

ValueError: Found unknown categories while calling cross_val_score

I'm working on the Titanic Dataset as my first Kaggle Project. But I get this User Warning And I'm trying to find out how to get rid of it.

So I made two Preprocessing sub-pipelines:

num_pipeline = Pipeline([
('imputer', SimpleImputer( strategy='median')), 
('scaler', StandardScaler()) ])

cat_pipeline = Pipeline([
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder()) ])

My selected features for building the model are :

numeric_features = ['Age', 'SibSp', 'Parch', 'Fare']
categorical_features = ['Pclass', 'Sex',  'Embarked']

My Preprocessing Pipeline is :

preprocessor = ColumnTransformer(
transformers = [
    ('num', num_pipeline, numeric_features),
    ('cat', cat_pipeline, categorical_features)
])

In my final Pipeline I add a classifier:

clf = Pipeline([
('Preprocessor' , preprocessor),
('Classifier', DecisionTreeClassifier()) ])

Then i call cross_val_score to evaluate the model, and this is when I get the User Warning:

cross_val_score(clf, X_train, y_train, cv=3, scoring="accuracy")

ValueError: Found unknown categories ['Missing'] in column 2 during transform UserWarning,

array([ nan, 0.70403587, 0.74774775])

My guess is the cross_val_score gets the first fold WITHOUT the 'Missing' category and then test it on another WITH the 'Missing' category.Hence the Error. So I tried to drop the rows with the missing values in 'Embarked', but still get the error which is weird.

Upvotes: 2

Views: 845

Answers (2)

nareto
nareto

Reputation: 185

I just incurred in this myself, and then I realized that it makes no sense to run OneHotEncoder separately for each fold of the cross-validation.

The docs for ColumnTrasnformer say

Processing your data before passing it to scikit-learn might be problematic for one of the following reasons:

  1. Incorporating statistics from test data into the preprocessors makes cross-validation scores unreliable (known as data leakage), for example in the case of scalers or imputing missing values.
  2. You may want to include the parameters of the preprocessors in a parameter search.

So it is correct you put SimpleImputer and StandardScaler in the ColumnTransformer which goes directly in the Pipeline fed to cross_val_score: they will be refitted indipendently for each fold, avoiding using information from the validation set.

However OneHotEncoder should be run in a separate step, before feeding the data to cross_val_score, i.e. X_train should already contain one-hot variables rather than categorical ones.

In this way not only will you avoid this Warning (due as you already mention to seeing a class in the validation set not present in the fold's training set) but also you will keep the one-hot variables consistent, as they should be. In other words you don't risk the one-hot variables get ordered or named differently for each fold.

Upvotes: 4

Khalid Saifullah
Khalid Saifullah

Reputation: 795

I hope you did the following:

  1. Clean the dataset i.e fill all na/missing values for all columns
  2. Filter and drop NaN rows i.e all values NaN rows

Upvotes: 0

Related Questions