Reputation: 2179
I am using python label encoder to transform my data. Here is my sample data.
Database Target Market_Description Brand \
0 CN_Milk powder_Incl_Others NaN Shanghai Hyper total O.Brand
1 CN_Milk powder_Incl_Others NaN Shanghai Hyper total O.Brand
2 CN_Milk powder_Incl_Others NaN Shanghai Hyper total O.Brand
Sub_Brand Category Class_Category
0 NaN NaN Hi Cal Adult Milk Powders- C1
1 NaN NaN Hi Cal Adult Milk Powders- C1
2 NaN NaN Hi Cal Adult Milk Powders- C1
I am applying the transformation across all columns
df3 = CountryDF.apply(preprocessing.LabelEncoder().fit_transform)
When I check the unique values for Target column, it says,
>>> print pd.unique(CountryDF.Target.ravel())
>>> [nan 'Elder' 'Others' 'Lady']
But when i check the same after transformation, I am getting multiple levels.
>>> print pd.unique(df3.Target.ravel())
>>> [ 40749 667723 667725 ..., 43347 43346 43345]
I am not sure how this works? I expected four unique values as I thought the transform implementation works by getting unique values and assigning sorted numpy on each, can anyone help me understand this.
EDIT :- This dataset is a subset of a big dataset. Does that have any relation to this?
EDIT2 :- @Kevin I tried what you suggested, its weird. see this.
Upvotes: 2
Views: 1528
Reputation: 8207
I don't think the large dataset is affecting your outcome. The purpose of LabelEncoder
is to transform the prediction targets (In your case, I'm assuming, the Target
column). From the User Guide
:
LabelEncoder is a utility class to help normalize labels such that they contain only values between 0 and n_classes-1.
Here's an example, notice I changed the values of Target
in your example CountryDF
, just for demonstration purposes:
from sklearn.preprocessing import LabelEncoder
import numpy as np
import pandas as pd
CountryDF = pd.DataFrame([['CN_Milk powder_Incl_Others',np.nan,'Shanghai Hyper total','O.Brand',np.nan,np.nan,'Hi Cal Adult Milk Powders- C1'],
['CN_Milk powder_Incl_Others','Elder','Shanghai Hyper total','O.Brand',np.nan,np.nan,'Hi Cal Adult Milk Powders- C1'],
['CN_Milk powder_Incl_Others','Others','Shanghai Hyper total','O.Brand',np.nan,np.nan,'Hi Cal Adult Milk Powders- C1'],
['CN_Milk powder_Incl_Others','Lady','Shanghai Hyper total','O.Brand',np.nan,np.nan,'Hi Cal Adult Milk Powders- C1'],
['CN_Milk powder_Incl_Others',np.nan,'Shanghai Hyper total','O.Brand','S_B1',np.nan,'Hi Cal Adult Milk Powders- C1'],
['CN_Milk powder_Incl_Others',np.nan,'Shanghai Hyper total','O.Brand','S_B2',np.nan,'Hi Cal Adult Milk Powders- C1']],
columns=['Database','Target','Market_Description','Brand','Sub_Brand', 'Category','Class_Category'])
First, initialize the LabelEncoder
, then fit and transform the data (while assigning the transformed data to a new column).
le = LabelEncoder() # initialze the LabelEncoder once
#Create a new column with transformed values.
CountryDF['EncodedTarget'] = le.fit_transform(CountryDF['Target'])
Notice, the last column, EncodedTarget
is a transformed copy of Target
.
CountryDF
Database Target Market_Description Brand Sub_Brand Category Class_Category EncodedTarget
0 CN_Milk powder_Incl_Others NaN Shanghai Hyper total O.Brand NaN NaN Hi Cal Adult Milk Powders- C1 0
1 CN_Milk powder_Incl_Others Elder Shanghai Hyper total O.Brand NaN NaN Hi Cal Adult Milk Powders- C1 1
2 CN_Milk powder_Incl_Others Others Shanghai Hyper total O.Brand NaN NaN Hi Cal Adult Milk Powders- C1 3
3 CN_Milk powder_Incl_Others Lady Shanghai Hyper total O.Brand NaN NaN Hi Cal Adult Milk Powders- C1 2
I hope this helps clear up LabelEncoder
. If this doesn't fully answer your question, it might lead you down the right path toward transforming your features (which may be what you're trying to do?) -- Check out OneHotEncoder
EDIT
I added two additional rows to CountryDF
(see above), it has two unique values to the Sub_Brand
column, which follow a series of consecutive NaN
. I'm stumped as to why you're seeing this behavior, it works for me, pandas 0.17.0 and scikit 0.17.
df3 = CountryDF.apply(LabelEncoder().fit_transform)
df3
Database Target Market_Description Brand Sub_Brand Category Class_Category
0 0 0 0 0 0 0 0
1 0 1 0 0 0 1 0
2 0 3 0 0 0 2 0
3 0 2 0 0 0 3 0
4 0 0 0 0 1 4 0
5 0 0 0 0 2 5 0
I can't reproduce your problem, Do you have a link to your data?
pd.unique(CountryDF.Target.ravel())
array([nan, 'Elder', 'Others', 'Lady'], dtype=object)
pd.unique(df3.Target.ravel())
array([0, 1, 3, 2])
Upvotes: 1