Reputation: 748
I am doing one-hot-encoding for the categorical data. When I'm testing, I do something like this:
data.append(train_data_X)
data.append(test_data_X)
one_hot_encode(data)
model.test(data[:test_data_X.shape[0])
I was wondering if there was a way of testing out my test data, without having access to my training data.
Upvotes: 0
Views: 63
Reputation: 4221
The usual best practice is to use scikit-learn
's OneHotEncoder
function, precisely to avoid the issue you are having.
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(categories = "auto", handle_unknown = 'ignore')
X_train_encoded = encoder.fit_transform(X_train)
X_test_encoded = encoder.transform(X_test)
This ensures the same One Hot Encoding will be implemented for the test set. So you can use X_train_encoded to train your model, and then X_test_encoded to evaluate it.
Upvotes: 2
Reputation: 21
Below is the straightforward approach but this may not work always (why it doesn't work always was explained after the code)
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder()
#fit the encoder
enc.fit(X_train)
#transform the data
X_train_encoded = enc.transform(X_train)
#tranform test data
X_test_encoded = enc.transform(X_test)
But there is small problem with this methodology. If your train data has 2 unique values in a column, encoder will create 2 dummy features. But if your test data has 3 unique values for the same column, we will have extra column and our model will throw exception. So it is always recommended to combine test and train data before using one-hot encoding and later recover split the data back into test and train based on indexes.
Upvotes: 0