Do I always have to keep a copy of training data if I do one hot encoding?

Question

I am doing one-hot-encoding for the categorical data. When I'm testing, I do something like this:

data.append(train_data_X)
data.append(test_data_X)
one_hot_encode(data)
model.test(data[:test_data_X.shape[0])

I was wondering if there was a way of testing out my test data, without having access to my training data.

MaximeKan · Accepted Answer

The usual best practice is to use scikit-learn's OneHotEncoder function, precisely to avoid the issue you are having.

from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(categories = "auto", handle_unknown = 'ignore')
X_train_encoded = encoder.fit_transform(X_train)
X_test_encoded = encoder.transform(X_test)

This ensures the same One Hot Encoding will be implemented for the test set. So you can use X_train_encoded to train your model, and then X_test_encoded to evaluate it.

Do I always have to keep a copy of training data if I do one hot encoding?

Answers (2)

Related Questions