Different Label Encoder values on Training and Test set is bad?

In my data set, I have a categorical feature called product.

Let's say in the training set, its values are in {"apple", "banana", "durian", "orange",....}. On the other hand, In test set, the values now can be {"banana", "orange", pineapple"}. There are some values that do not have in the training set (e.g., pineapple).

I know that if we have all possible values in advance, we can create a Label Encoder variable, and fit it with all the values that the feature can have. But in this case, I can not ensure that the training set can cover all the values in the test set (i.e., when some new products appear).

It makes a big concern to me because I'm afraid that when using Label Encoding, the training set can be mapped as {"apple": 1, "banana": 2, "durian": 3, "orange": 4, ... (thousands more) }, but when it comes to mapping on the test set, we're gonna get {"banana": 1, "orange":2 , pineapple":3}.

My questions are:

1. Does it have a negative impact on classification model ? For example, if apple becomes an important value in the product feature, as far as I know, the model will treat 1 (the numeric value of apple) with more concern. Is it misleading when 1 is banana in the test set ?
1. Is there any way that I can deal with kind of label encoder problems in which have different values on training and test set ?

I found some relevant links like this one, but it's not exactly my problem.

Update: Please note that the product can have thousands of values, that's why I use Label Encoder here rather than One Hot Coding.

Upvotes: 3

Answers (2)

Venkatachalam

Reputation: 16966

You have to use one hot encoding when feeding the categorical variables into the ML models. Otherwise model will have to treat apple < banana < durian < orange, which actually is not the case.

For the unknown values coming up during the test dataset, all the columns for that variable will be zero, which eventually make the model understand that this value is not seen during the training time.

X= [["apple"], ["banana"], ["durian"], ["orange"]]
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(handle_unknown='ignore')
enc.fit(X)

enc.categories_

categories:

[array(['apple', 'banana', 'durian', 'orange'], dtype=object)]

During test data,

enc.transform([["banana"], ["orange"], ["pineapple"]]).toarray()

output:

array([[0., 1., 0., 0.],
       [0., 0., 0., 1.],
       [0., 0., 0., 0.]])

Upvotes: 4

C M Khaled Saifullah

Reputation: 164

If i was in your position, I will use a dictionary for the training data. Same dictionary will be used in test data too. There might be case where test data have some value/word that train data did not encountered. I will use a special index named as unknown token for those cases. Therefore my dictionary would be: {"UNK":0,apple": 1, "banana": 2, "durian": 3, "orange": 4}

Then for test data {"banana, orange , pineapple"}, I will have {2,4,0}

I hope that will be useful.

Upvotes: 2

Different Label Encoder values on Training and Test set is bad?

Answers (2)

Related Questions