Reputation: 2625
Training dataset total categorical columns: 27
Test dataset total categorical columns: 27
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols_test = pd.DataFrame(OH_encoder.fit_transform(X_test[test_low_cardinality_cols]))
After Encoding, while preparing Test data for prediction,
number of columns from test data: 115
number of columns from train data: 122
I checked the cardinality in the test data, it is low for few columns compare to train data columns.
Train_data.column#1: 2 Test_data:column#1: 1 Train_data.column#2: 5 Test_data:column#2: 3 and more..
so automatically while one-hot encoding, the number of columns will be reduced. is there any better way to prepare the test data set without any data lose?
Upvotes: 2
Views: 1171
Reputation: 16966
The ideal procedure would be fit the OneHotEncoder
in training data and then do a transform
in test data. By this way, you will get a consistent number of columns in train and test data.
Something like the following:
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_encoder.fit(X_train)
OH_cols_test = pd.DataFrame(OH_encoder.transform(X_test))
To understand the column name of the output of OneHotEncoder
use get_feature_names
method. Probably this answer might help.
Upvotes: 1