One-hot encoding with categorial dataset: how to deal with different values (less number) in categorical data

Question

Training dataset total categorical columns: 27

Test dataset total categorical columns: 27

OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols_test = pd.DataFrame(OH_encoder.fit_transform(X_test[test_low_cardinality_cols]))

After Encoding, while preparing Test data for prediction,

number of columns from test data: 115

number of columns from train data: 122

I checked the cardinality in the test data, it is low for few columns compare to train data columns.

Train_data.column#1: 2
Test_data:column#1: 1

Train_data.column#2: 5
Test_data:column#2: 3
and more..

so automatically while one-hot encoding, the number of columns will be reduced. is there any better way to prepare the test data set without any data lose?

Venkatachalam · Accepted Answer

The ideal procedure would be fit the OneHotEncoder in training data and then do a transform in test data. By this way, you will get a consistent number of columns in train and test data.

Something like the following:

OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_encoder.fit(X_train)

OH_cols_test = pd.DataFrame(OH_encoder.transform(X_test))

To understand the column name of the output of OneHotEncoder use get_feature_names method. Probably this answer might help.

One-hot encoding with categorial dataset: how to deal with different values (less number) in categorical data

Answers (1)

Related Questions