hlud6646
hlud6646

Reputation: 419

Why shouldn't the sklearn LabelEncoder be used to encode input data?

The docs for sklearn.LabelEncoder start with

This transformer should be used to encode target values, i.e. y, and not the input X.

Why is this?

I post just one example of this recommendation being ignored in practice, although there seems to be loads more. https://www.kaggle.com/matleonard/feature-generation contains

#(ks is the input data)

# Label encoding
cat_features = ['category', 'currency', 'country']
encoder = LabelEncoder()
encoded = ks[cat_features].apply(encoder.fit_transform)

Upvotes: 13

Views: 3262

Answers (3)

sogu
sogu

Reputation: 3076

It is not that big of deal that it changes the output value y because it is only relearn based on that (if it a regression based on error).

The problem if it changes up the weights of the input values “X” that makes it impossible to do correct predictions.

You can do it on the X if there are not many options for example 2 category, 2 currency, 2 city encoded in to int-s does not changes the game too much.

Upvotes: 2

Alaa M.
Alaa M.

Reputation: 5273

Maybe because:

  1. It doesn't naturally work on multiple columns at once.
  2. It doesn't support ordering. I.e. if your categories are ordinal, such as:

Awful, Bad, Average, Good, Excellent

LabelEncoder would give them an arbitrary order (probably as they are encountered in the data), which will not help your classifier.

In this case you could use either an OrdinalEncoder or a manual replacement.

1. OrdinalEncoder:

Encode categorical features as an integer array.

df = pd.DataFrame(data=[['Bad', 200], ['Awful', 100], ['Good', 350], ['Average', 300], ['Excellent', 1000]], columns=['Quality', 'Label'])
enc = OrdinalEncoder(categories=[['Awful', 'Bad', 'Average', 'Good', 'Excellent']])  # Use the 'categories' parameter to specify the desired order. Otherwise the ordered is inferred from the data.
enc.fit_transform(df[['Quality']])  # Can either fit on 1 feature, or multiple features at once.

Output:

array([[1.],
       [0.],
       [3.],
       [2.],
       [4.]])

Notice the logical order in the ouput.

2. Manual replacement:

scale_mapper = {'Awful': 0, 'Bad': 1, 'Average': 2, 'Good': 3, 'Excellent': 4}
df['Quality'].replace(scale_mapper)

Output:

0    1
1    0
2    3
3    2
4    4
Name: Quality, dtype: int64

Upvotes: 3

Gwang-Jin Kim
Gwang-Jin Kim

Reputation: 9865

I think they warn from using it for X (input data), because:

  • Categorical input data are better encoded as one hot encoding and not as integers in most of the cases, since mostly you have non-sortable categories.

  • Second, another technical problem will be that LabelEncoder is not programmed to handle tables (column-wise/feature-wise encoding would be necessary for X). LabelEncoder assumes that the data is just a flat list. That will be the problem.

from sklearn.preprocessing import LabelEncoder

enc = LabelEncoder()

categories = [x for x in 'abcdabaccba']
categories
## ['a', 'b', 'c', 'd', 'a', 'b', 'a', 'c', 'c', 'b', 'a']

categories_numerical = enc.fit_transform(categories)

categories_numerical
# array([0, 1, 2, 3, 0, 1, 0, 2, 2, 1, 0])

# so it makes out of categories numbers
# and can transform back

enc.inverse_transform(categories_numerical)
# array(['a', 'b', 'c', 'd', 'a', 'b', 'a', 'c', 'c', 'b', 'a'], dtype='<U1')

Upvotes: -1

Related Questions