ds_user
ds_user

Reputation: 2179

Label Encoder multiple levels

I am using python label encoder to transform my data. Here is my sample data.

                         Database      Target    Market_Description    Brand  \
0            CN_Milk powder_Incl_Others    NaN  Shanghai Hyper total  O.Brand   
1            CN_Milk powder_Incl_Others    NaN  Shanghai Hyper total  O.Brand   
2            CN_Milk powder_Incl_Others    NaN  Shanghai Hyper total  O.Brand   

  Sub_Brand Category                   Class_Category  
0       NaN      NaN  Hi Cal Adult Milk Powders- C1  
1       NaN      NaN  Hi Cal Adult Milk Powders- C1  
2       NaN      NaN  Hi Cal Adult Milk Powders- C1 

I am applying the transformation across all columns

df3 = CountryDF.apply(preprocessing.LabelEncoder().fit_transform)   

When I check the unique values for Target column, it says,

>>> print pd.unique(CountryDF.Target.ravel())

>>> [nan 'Elder' 'Others' 'Lady']

But when i check the same after transformation, I am getting multiple levels.

>>> print pd.unique(df3.Target.ravel())
>>> [ 40749 667723 667725 ...,  43347  43346  43345]

I am not sure how this works? I expected four unique values as I thought the transform implementation works by getting unique values and assigning sorted numpy on each, can anyone help me understand this.

EDIT :- This dataset is a subset of a big dataset. Does that have any relation to this?

EDIT2 :- @Kevin I tried what you suggested, its weird. see this. enter image description here

Upvotes: 2

Views: 1528

Answers (1)

Kevin
Kevin

Reputation: 8207

I don't think the large dataset is affecting your outcome. The purpose of LabelEncoder is to transform the prediction targets (In your case, I'm assuming, the Target column). From the User Guide:

LabelEncoder is a utility class to help normalize labels such that they contain only values between 0 and n_classes-1.

Here's an example, notice I changed the values of Target in your example CountryDF, just for demonstration purposes:

from sklearn.preprocessing import LabelEncoder
import numpy as np
import pandas as pd

CountryDF = pd.DataFrame([['CN_Milk powder_Incl_Others',np.nan,'Shanghai Hyper total','O.Brand',np.nan,np.nan,'Hi Cal Adult Milk Powders- C1'],
                              ['CN_Milk powder_Incl_Others','Elder','Shanghai Hyper total','O.Brand',np.nan,np.nan,'Hi Cal Adult Milk Powders- C1'],
                              ['CN_Milk powder_Incl_Others','Others','Shanghai Hyper total','O.Brand',np.nan,np.nan,'Hi Cal Adult Milk Powders- C1'],
                              ['CN_Milk powder_Incl_Others','Lady','Shanghai Hyper total','O.Brand',np.nan,np.nan,'Hi Cal Adult Milk Powders- C1'],
                             ['CN_Milk powder_Incl_Others',np.nan,'Shanghai Hyper total','O.Brand','S_B1',np.nan,'Hi Cal Adult Milk Powders- C1'],
                             ['CN_Milk powder_Incl_Others',np.nan,'Shanghai Hyper total','O.Brand','S_B2',np.nan,'Hi Cal Adult Milk Powders- C1']],
                            columns=['Database','Target','Market_Description','Brand','Sub_Brand', 'Category','Class_Category'])

First, initialize the LabelEncoder, then fit and transform the data (while assigning the transformed data to a new column).

le = LabelEncoder() # initialze the LabelEncoder once

#Create a new column with transformed values.
CountryDF['EncodedTarget'] = le.fit_transform(CountryDF['Target'])

Notice, the last column, EncodedTarget is a transformed copy of Target.

CountryDF

Database    Target  Market_Description  Brand   Sub_Brand   Category    Class_Category  EncodedTarget
0   CN_Milk powder_Incl_Others  NaN     Shanghai Hyper total    O.Brand     NaN     NaN     Hi Cal Adult Milk Powders- C1   0
1   CN_Milk powder_Incl_Others  Elder   Shanghai Hyper total    O.Brand     NaN     NaN     Hi Cal Adult Milk Powders- C1   1
2   CN_Milk powder_Incl_Others  Others  Shanghai Hyper total    O.Brand     NaN     NaN     Hi Cal Adult Milk Powders- C1   3
3   CN_Milk powder_Incl_Others  Lady    Shanghai Hyper total    O.Brand     NaN     NaN     Hi Cal Adult Milk Powders- C1   2

I hope this helps clear up LabelEncoder. If this doesn't fully answer your question, it might lead you down the right path toward transforming your features (which may be what you're trying to do?) -- Check out OneHotEncoder

EDIT I added two additional rows to CountryDF (see above), it has two unique values to the Sub_Brand column, which follow a series of consecutive NaN. I'm stumped as to why you're seeing this behavior, it works for me, pandas 0.17.0 and scikit 0.17.

df3 = CountryDF.apply(LabelEncoder().fit_transform)
df3
Database    Target  Market_Description  Brand   Sub_Brand   Category    Class_Category
0   0   0   0   0   0   0   0
1   0   1   0   0   0   1   0
2   0   3   0   0   0   2   0
3   0   2   0   0   0   3   0
4   0   0   0   0   1   4   0
5   0   0   0   0   2   5   0

I can't reproduce your problem, Do you have a link to your data?

pd.unique(CountryDF.Target.ravel())    
array([nan, 'Elder', 'Others', 'Lady'], dtype=object)
pd.unique(df3.Target.ravel())
array([0, 1, 3, 2])

Upvotes: 1

Related Questions