rowana
rowana

Reputation: 748

Do I always have to keep a copy of training data if I do one hot encoding?

I am doing one-hot-encoding for the categorical data. When I'm testing, I do something like this:

data.append(train_data_X)
data.append(test_data_X)
one_hot_encode(data)
model.test(data[:test_data_X.shape[0])

I was wondering if there was a way of testing out my test data, without having access to my training data.

Upvotes: 0

Views: 63

Answers (2)

MaximeKan
MaximeKan

Reputation: 4221

The usual best practice is to use scikit-learn's OneHotEncoder function, precisely to avoid the issue you are having.

from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(categories = "auto", handle_unknown = 'ignore')
X_train_encoded = encoder.fit_transform(X_train)
X_test_encoded = encoder.transform(X_test)

This ensures the same One Hot Encoding will be implemented for the test set. So you can use X_train_encoded to train your model, and then X_test_encoded to evaluate it.

Upvotes: 2

srikar vaka
srikar vaka

Reputation: 21

Below is the straightforward approach but this may not work always (why it doesn't work always was explained after the code)

from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder()

#fit the encoder
enc.fit(X_train)
#transform the data
X_train_encoded = enc.transform(X_train)
#tranform test data
X_test_encoded = enc.transform(X_test)

But there is small problem with this methodology. If your train data has 2 unique values in a column, encoder will create 2 dummy features. But if your test data has 3 unique values for the same column, we will have extra column and our model will throw exception. So it is always recommended to combine test and train data before using one-hot encoding and later recover split the data back into test and train based on indexes.

Upvotes: 0

Related Questions