Reputation: 419
The docs for sklearn.LabelEncoder start with
This transformer should be used to encode target values, i.e. y, and not the input X.
Why is this?
I post just one example of this recommendation being ignored in practice, although there seems to be loads more. https://www.kaggle.com/matleonard/feature-generation contains
#(ks is the input data)
# Label encoding
cat_features = ['category', 'currency', 'country']
encoder = LabelEncoder()
encoded = ks[cat_features].apply(encoder.fit_transform)
Upvotes: 13
Views: 3262
Reputation: 3076
It is not that big of deal that it changes the output value y because it is only relearn based on that (if it a regression based on error).
The problem if it changes up the weights of the input values “X” that makes it impossible to do correct predictions.
You can do it on the X if there are not many options for example 2 category, 2 currency, 2 city encoded in to int-s does not changes the game too much.
Upvotes: 2
Reputation: 5273
Maybe because:
Awful, Bad, Average, Good, Excellent
LabelEncoder
would give them an arbitrary order (probably as they are encountered in the data), which will not help your classifier.
In this case you could use either an OrdinalEncoder
or a manual replacement.
Encode categorical features as an integer array.
df = pd.DataFrame(data=[['Bad', 200], ['Awful', 100], ['Good', 350], ['Average', 300], ['Excellent', 1000]], columns=['Quality', 'Label'])
enc = OrdinalEncoder(categories=[['Awful', 'Bad', 'Average', 'Good', 'Excellent']]) # Use the 'categories' parameter to specify the desired order. Otherwise the ordered is inferred from the data.
enc.fit_transform(df[['Quality']]) # Can either fit on 1 feature, or multiple features at once.
Output:
array([[1.],
[0.],
[3.],
[2.],
[4.]])
Notice the logical order in the ouput.
scale_mapper = {'Awful': 0, 'Bad': 1, 'Average': 2, 'Good': 3, 'Excellent': 4}
df['Quality'].replace(scale_mapper)
Output:
0 1
1 0
2 3
3 2
4 4
Name: Quality, dtype: int64
Upvotes: 3
Reputation: 9865
I think they warn from using it for X (input data), because:
Categorical input data are better encoded as one hot encoding and not as integers in most of the cases, since mostly you have non-sortable categories.
Second, another technical problem will be that LabelEncoder is not programmed to handle tables (column-wise/feature-wise encoding would be necessary for X). LabelEncoder assumes that the data is just a flat list. That will be the problem.
from sklearn.preprocessing import LabelEncoder
enc = LabelEncoder()
categories = [x for x in 'abcdabaccba']
categories
## ['a', 'b', 'c', 'd', 'a', 'b', 'a', 'c', 'c', 'b', 'a']
categories_numerical = enc.fit_transform(categories)
categories_numerical
# array([0, 1, 2, 3, 0, 1, 0, 2, 2, 1, 0])
# so it makes out of categories numbers
# and can transform back
enc.inverse_transform(categories_numerical)
# array(['a', 'b', 'c', 'd', 'a', 'b', 'a', 'c', 'c', 'b', 'a'], dtype='<U1')
Upvotes: -1