wheeeee
wheeeee

Reputation: 1505

what does LabelEncoder().fit() do?

I'm reading some code that has the following lines:

from sklearn import preprocessing

le = preprocessing.LabelEncoder()
le.fit(df[1])

Where df[1] is of type pandas.core.series.Series and contains string values such as "basketball", "football", "soccer", etc.

What does the method le.fit() do? I saw that some other fit methods are used to train the model, but that doesn't make sense to me because the input here is purely the labels, not the training data. The documentation simply says "Fit label encoder." What does that mean?

Upvotes: 5

Views: 12977

Answers (2)

chitown88
chitown88

Reputation: 28564

It takes a categorical column and converts/maps it to numerical values.

If for example we have a dataset of people and their favorite sport, and we want to do some machine learning (that uses mathematics) on that dataframe, mathematically, we can't do any computations to the string 'basketball' or 'football'. But what we can do is map a value for each of those sports that allow machine learning algorithms to do their thing:

For example: 'basketball' = 0, 'football' = 1, 'soccer' = 2, etc. We could do that manually using a dictionary, and just apply that mapping to a column, or we can use the le.fit() to do that for us.

So we use it on our training data, and it will figure out the unique values and assign a value to it:

import pandas as pd
from sklearn import preprocessing


train_df = pd.DataFrame(
    [
        ['Person1', 'basketball'],
        ['Person2', 'football'],
        ['Person3', 'basketball'],
        ['Person4', 'basketball'],
        ['Person5', 'soccer'],
        ['Person6', 'soccer'],
        ['Person7', 'soccer'],
        ['Person8', 'basketball'],
        ['Person9', 'football'],
    ],
    columns=['person', 'sport']
)

le = preprocessing.LabelEncoder()
le.fit(train_df['sport'])

And now, we can transform the 'sport' column in our test data using that determined mapping from the le.fit()

test_df = pd.DataFrame(
    [
        ['Person11', 'soccer'],
        ['Person12', 'soccer'],
        ['Person13', 'basketball'],
        ['Person14', 'football'],
        ['Person15', 'football'],
        ['Person16', 'soccer'],
        ['Person17', 'soccer'],
        ['Person18', 'basketball'],
        ['Person19', 'soccer'],
    ],
    columns=['person', 'sport']
)

le.transform(test_df['sport'])

And if you want to see how that mapping looks, we'll just throw that on the test set as a column:

test_df['encoded'] = le.transform(test_df['sport'])

And now we see it assigned 'soccer' to the value 2, 'basketball' to 0, and 'football' to 1.

print(test_df)
     person       sport  encoded
0  Person11      soccer        2
1  Person12      soccer        2
2  Person13  basketball        0
3  Person14    football        1
4  Person15    football        1
5  Person16      soccer        2
6  Person17      soccer        2
7  Person18  basketball        0
8  Person19      soccer        2

Upvotes: 5

Miguel Trejo
Miguel Trejo

Reputation: 6667

As @PSK says, the LabelEncoder() method will store the unique values of the array you're passing to. For example, if it is a numerical array it will call numpy.unique()

import pandas as pd
d = {'col1': [1, 2, 2, 3], 'col2': ['A', 'B', 'B', 'C']}
df = pd.DataFrame(data=d)

# For numerical array
np.unique(df.col1)
>>> array([1, 2, 3])

or basically set if it is an object type

set(df.col2)
>>> {'A', 'B', 'C'}

and store this result in the attribute .classes_ of LabelEncoder, which can later be access by other methods of the class like transform() to encode new data.

Upvotes: 0

Related Questions