Murchie85
Murchie85

Reputation: 865

How to preprocess test data after one hot encoding

I am a bit confused here, I have one hot encoded my categorical columns for all those < 10 unique values low_cardinality_cols , and dropped the remaining categorical columns for both Training and validation data.

Now I aim to apply my model to new data in a test.csv that. What would be the best method for pre-processing the test data to match train/validation format?

My concerns are:
1. Test_data.csv will certainly have different cardinality for those columns
2. If I one hot encode test data using low cardinality columns from training I get Input contains NaN but my train, valid & test columns are all the same number.

Sample one hot encoding below, this is for kaggle competition/intermediate course here

# Apply one-hot encoder to each column with categorical data
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train[low_cardinality_cols]))
OH_cols_valid = pd.DataFrame(OH_encoder.transform(X_valid[low_cardinality_cols]))

# One-hot encoding removed index; put it back
OH_cols_train.index = X_train.index
OH_cols_valid.index = X_valid.index

# Remove categorical columns (will replace with one-hot encoding)
# This also saves us the hassle of dropping columns 

num_X_train = X_train.drop(object_cols, axis=1)
num_X_valid = X_valid.drop(object_cols, axis=1)

# Add one-hot encoded columns to numerical features
OH_X_train = pd.concat([num_X_train, OH_cols_train], axis=1)
OH_X_valid = pd.concat([num_X_valid, OH_cols_valid], axis=1)

Upvotes: 2

Views: 2301

Answers (2)

Parthasarathy Subburaj
Parthasarathy Subburaj

Reputation: 4264

As far as I could think, there are two possible solutions to this, I will illustrate both here and you can pick whichever works for you.

Solution 1

If it is possible for you to get all the possible levels/values of the categorical variable that you are planning to encode, you can pass them as the categories parameter when you perform one-hot encoding the default value for categories is auto which determines the categories automatically from the training data and will not account for the new categories found in testing data. Enforcing categories as a list of all possible categories will help us solve this problem. As even if your testing data has new categories that were not present in training/validation data they will all be encoded correctly and you won't be getting NaNs.

Solution 2

In case if you are not able to collect all possible categories of a categorical column you can go ahead and fit the one-hot encoder the way you have done, and when you try to transform your test data in order to handle NaNs which you will be encountering when you find a new class, you can use some kind of imputation techniques like SimpleImputer or IterativeImputer to impute the missing values and process further.

Upvotes: 1

glemaitre
glemaitre

Reputation: 1003

I would advise 2 things:

  • OneHotEncoder is a parameter handle_unknown="error" per default. It should be turned to handle_unknow="ignore" in the case that you mention (categories in testing not known during training).
  • Use a scikit-learn pipeline including your predictor instead of calling fit_transform and transform and then give the data to the predictor

Upvotes: 4

Related Questions