Reputation: 1505
I'm reading some code that has the following lines:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(df[1])
Where df[1]
is of type pandas.core.series.Series
and contains string values such as "basketball", "football", "soccer", etc.
What does the method le.fit()
do? I saw that some other fit methods are used to train the model, but that doesn't make sense to me because the input here is purely the labels, not the training data. The documentation simply says "Fit label encoder." What does that mean?
Upvotes: 5
Views: 12977
Reputation: 28564
It takes a categorical column and converts/maps it to numerical values.
If for example we have a dataset of people and their favorite sport, and we want to do some machine learning (that uses mathematics) on that dataframe, mathematically, we can't do any computations to the string 'basketball'
or 'football'
. But what we can do is map a value for each of those sports that allow machine learning algorithms to do their thing:
For example: 'basketball' = 0
, 'football' = 1
, 'soccer' = 2
, etc.
We could do that manually using a dictionary, and just apply that mapping to a column, or we can use the le.fit()
to do that for us.
So we use it on our training data, and it will figure out the unique values and assign a value to it:
import pandas as pd
from sklearn import preprocessing
train_df = pd.DataFrame(
[
['Person1', 'basketball'],
['Person2', 'football'],
['Person3', 'basketball'],
['Person4', 'basketball'],
['Person5', 'soccer'],
['Person6', 'soccer'],
['Person7', 'soccer'],
['Person8', 'basketball'],
['Person9', 'football'],
],
columns=['person', 'sport']
)
le = preprocessing.LabelEncoder()
le.fit(train_df['sport'])
And now, we can transform the 'sport'
column in our test data using that determined mapping from the le.fit()
test_df = pd.DataFrame(
[
['Person11', 'soccer'],
['Person12', 'soccer'],
['Person13', 'basketball'],
['Person14', 'football'],
['Person15', 'football'],
['Person16', 'soccer'],
['Person17', 'soccer'],
['Person18', 'basketball'],
['Person19', 'soccer'],
],
columns=['person', 'sport']
)
le.transform(test_df['sport'])
And if you want to see how that mapping looks, we'll just throw that on the test set as a column:
test_df['encoded'] = le.transform(test_df['sport'])
And now we see it assigned 'soccer'
to the value 2
, 'basketball'
to 0
, and 'football'
to 1
.
print(test_df)
person sport encoded
0 Person11 soccer 2
1 Person12 soccer 2
2 Person13 basketball 0
3 Person14 football 1
4 Person15 football 1
5 Person16 soccer 2
6 Person17 soccer 2
7 Person18 basketball 0
8 Person19 soccer 2
Upvotes: 5
Reputation: 6667
As @PSK says, the LabelEncoder()
method will store the unique values of the array you're passing to. For example, if it is a numerical array it will call numpy.unique()
import pandas as pd
d = {'col1': [1, 2, 2, 3], 'col2': ['A', 'B', 'B', 'C']}
df = pd.DataFrame(data=d)
# For numerical array
np.unique(df.col1)
>>> array([1, 2, 3])
or basically set if it is an object
type
set(df.col2)
>>> {'A', 'B', 'C'}
and store this result in the attribute .classes_
of LabelEncoder
, which can later be access by other methods of the class like transform()
to encode new data.
Upvotes: 0