what does LabelEncoder().fit() do?

Question

I'm reading some code that has the following lines:

from sklearn import preprocessing

le = preprocessing.LabelEncoder()
le.fit(df[1])

Where df[1] is of type pandas.core.series.Series and contains string values such as "basketball", "football", "soccer", etc.

What does the method le.fit() do? I saw that some other fit methods are used to train the model, but that doesn't make sense to me because the input here is purely the labels, not the training data. The documentation simply says "Fit label encoder." What does that mean?

chitown88 · Accepted Answer

It takes a categorical column and converts/maps it to numerical values.

If for example we have a dataset of people and their favorite sport, and we want to do some machine learning (that uses mathematics) on that dataframe, mathematically, we can't do any computations to the string 'basketball' or 'football'. But what we can do is map a value for each of those sports that allow machine learning algorithms to do their thing:

For example: 'basketball' = 0, 'football' = 1, 'soccer' = 2, etc. We could do that manually using a dictionary, and just apply that mapping to a column, or we can use the le.fit() to do that for us.

So we use it on our training data, and it will figure out the unique values and assign a value to it:

import pandas as pd
from sklearn import preprocessing


train_df = pd.DataFrame(
    [
        ['Person1', 'basketball'],
        ['Person2', 'football'],
        ['Person3', 'basketball'],
        ['Person4', 'basketball'],
        ['Person5', 'soccer'],
        ['Person6', 'soccer'],
        ['Person7', 'soccer'],
        ['Person8', 'basketball'],
        ['Person9', 'football'],
    ],
    columns=['person', 'sport']
)

le = preprocessing.LabelEncoder()
le.fit(train_df['sport'])

And now, we can transform the 'sport' column in our test data using that determined mapping from the le.fit()

test_df = pd.DataFrame(
    [
        ['Person11', 'soccer'],
        ['Person12', 'soccer'],
        ['Person13', 'basketball'],
        ['Person14', 'football'],
        ['Person15', 'football'],
        ['Person16', 'soccer'],
        ['Person17', 'soccer'],
        ['Person18', 'basketball'],
        ['Person19', 'soccer'],
    ],
    columns=['person', 'sport']
)

le.transform(test_df['sport'])

And if you want to see how that mapping looks, we'll just throw that on the test set as a column:

test_df['encoded'] = le.transform(test_df['sport'])

And now we see it assigned 'soccer' to the value 2, 'basketball' to 0, and 'football' to 1.

print(test_df)
     person       sport  encoded
0  Person11      soccer        2
1  Person12      soccer        2
2  Person13  basketball        0
3  Person14    football        1
4  Person15    football        1
5  Person16      soccer        2
6  Person17      soccer        2
7  Person18  basketball        0
8  Person19      soccer        2

what does LabelEncoder().fit() do?

Answers (2)

Related Questions