gbhrea
gbhrea

Reputation: 494

OneHotEncoding Mapping

To discretize categorical features I'm using a LabelEncoder and OneHotEncoder. I know that LabelEncoder maps data alphabetically, but how does OneHotEncoder map data?

I have a pandas dataframe, dataFeat with 5 different columns, and 4 possible labels, like above. dataFeat = data[['Feat1', 'Feat2', 'Feat3', 'Feat4', 'Feat5']]

Feat1  Feat2  Feat3  Feat4  Feat5
  A      B      A      A      A
  B      B      C      C      C
  D      D      A       A     B
  C      C      A       A     A  

I apply a labelencoder like this,

le = preprocessing.LabelEncoder()

intIndexed = dataFeat.apply(le.fit_transform)

This is how the labels are encoded by the LabelEncoder

Label   LabelEncoded
 A         0
 B         1
 C         2
 D         3

I then apply a OneHotEncoder like this

enc = OneHotEncoder(sparse = False)

encModel = enc.fit(intIndexed)

dataFeatY = encModel.transform(intIndexed)

intIndexed.shape = 94,5 and dataFeatY.shape=94,20 .

I am a bit confused with the shape of dataFeatY - shouldn't it also be 95,5?

Following MhFarahani answer below, I have done this to see how labels are mapped

import numpy as np

S = np.array(['A', 'B','C','D'])
le = LabelEncoder()
S = le.fit_transform(S)
print(S)

[0 1 2 3]

ohe = OneHotEncoder()
one_hot = ohe.fit_transform(S.reshape(-1,1)).toarray()
print(one_hot.T)

[[ 1.  0.  0.  0.]
 [ 0.  1.  0.  0.]
 [ 0.  0.  1.  0.]
 [ 0.  0.  0.  1.]]

Does this mean that labels are mapped like this, or is it different for each column ? (which would explain the shape being 94,20)

Label   LabelEncoded    OneHotEncoded
 A         0               1.  0.  0.  0
 B         1               0.  1.  0.  0.
 C         2               0.  0.  1.  0.
 D         3               0.  0.  0.  1.

Upvotes: 3

Views: 7148

Answers (1)

MhFarahani
MhFarahani

Reputation: 970

One hot encoding means that you create vectors of one and zero. So the order does not matter. In sklearn, first you need to encode the categorical data to numerical data and then feed them to the OneHotEncoder, for example:

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

S = np.array(['b','a','c'])
le = LabelEncoder()
S = le.fit_transform(S)
print(S)
ohe = OneHotEncoder()
one_hot = ohe.fit_transform(S.reshape(-1,1)).toarray()
print(one_hot)

which results in:

[1 0 2]

[[ 0.  1.  0.]
 [ 1.  0.  0.]
 [ 0.  0.  1.]]

But pandas directly convert the categorical data:

import pandas as pd
S = pd.Series( {'A': ['b', 'a', 'c']})
print(S)
one_hot = pd.get_dummies(S['A'])
print(one_hot)

which outputs:

A    [b, a, c]
dtype: object

   a  b  c
0  0  1  0
1  1  0  0
2  0  0  1

as you can see during the mapping, for each categorical feature a vector is created. The elements of the vectors are one at the location of the categorical feature and zero every where else. Here is an example when there are only two categorical features in the series:

S = pd.Series( {'A': ['a', 'a', 'c']})
print(S)
one_hot = pd.get_dummies(S['A'])
print(one_hot)

results in:

A    [a, a, c]
dtype: object

   a  c
0  1  0
1  1  0
2  0  1

EDITS TO ANSWER THE NEW QUESTION

Lets start with this question: Why do we perform a one hot encoding? IF you encode a categorical data like ['a','b','c'] to integers [1,2,3] (e.g. with LableEncoder), in addition to encoding your categorical data, you would give them some weights as 1 < 2 < 3. This way of encoding is fine for some machine learning techniques like RandomForest. But many machine learning techniques would assume that in this case 'a' < 'b' < 'c' if you encoded them with 1, 2, 3 respectively. In order to avoid this issue, you can create a column for each unique categorical variable in your data. In other words, you create a new feature for each categorical variables (here one column for 'a' one for 'b' and one for 'c'). The values in these new columns are set to one if the variable was in that index and zero in other places.

For the array in your example, the one hot encoder would be:

features ->  A   B   C   D 

          [[ 1.  0.  0.  0.]
           [ 0.  1.  0.  0.]
           [ 0.  0.  1.  0.]
           [ 0.  0.  0.  1.]]

You have 4 categorical variables "A", "B", "C", "D". Therefore, OneHotEncoder would populate your (4,) array to (4,4) to have one vector (or column) for each categorical variable (which will be your new features). Since "A" the 0 element of your array, the index 0 of your first column is set to 1 and the rest are set to 0. Similarly, the second vector (column) belongs to feature "B" and since "B" was in the index 1 of your array, the index 1 of the "B" vector is set to 1 and the rest are set to zero. The same applies for the rest of features.

Let me change your array. Maybe it can help you to better understand how label encoder works:

S = np.array(['D', 'B','C','A'])
S = le.fit_transform(S)
enc = OneHotEncoder()
encModel = enc.fit_transform(S.reshape(-1,1)).toarray()
print(encModel)

now the result is the following. Here the first column is 'A' and since it was last element of your array (index = 3), the last element of first column would be 1.

features ->  A   B   C   D
          [[ 0.  0.  0.  1.]
           [ 0.  1.  0.  0.]
           [ 0.  0.  1.  0.]
           [ 1.  0.  0.  0.]]

Regarding your pandas dataframe, dataFeat, you are wrong even in the first step about how LableEncoder works. When you apply LableEncoder it fits to each column at the time and encode it; then, it goes to the next column and make a new fit to that column. Here is what you should get:

from sklearn.preprocessing import LabelEncoder
df =  pd.DataFrame({'Feat1': ['A','B','D','C'],'Feat2':['B','B','D','C'],'Feat3':['A','C','A','A'],
                    'Feat4':['A','C','A','A'],'Feat5':['A','C','B','A']})
print('my data frame:')
print(df)

le = LabelEncoder()
intIndexed = df.apply(le.fit_transform)
print('Encoded data frame')
print(intIndexed)

results:

my data frame:
  Feat1 Feat2 Feat3 Feat4 Feat5
0     A     B     A     A     A
1     B     B     C     C     C
2     D     D     A     A     B
3     C     C     A     A     A

Encoded data frame
   Feat1  Feat2  Feat3  Feat4  Feat5
0      0      0      0      0      0
1      1      0      1      1      2
2      3      2      0      0      1
3      2      1      0      0      0

Note that in the first column Feat1 'A' is encoded to 0 but in second column Feat2 the 'B' element is 0. This happens since LableEncoder fits to each column and transform it separately. Note that in your second column among ('B', 'C', 'D') the variable 'B' is alphabetically superior.

And finally, here is what you are looking for with sklearn:

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder()
label_encoder = LabelEncoder()
data_lable_encoded = df.apply(label_encoder.fit_transform).as_matrix()
data_feature_onehot = encoder.fit_transform(data_lable_encoded).toarray()
print(data_feature_onehot)

which gives you:

[[ 1.  0.  0.  0.  1.  0.  0.  1.  0.  1.  0.  1.  0.  0.]
 [ 0.  1.  0.  0.  1.  0.  0.  0.  1.  0.  1.  0.  0.  1.]
 [ 0.  0.  0.  1.  0.  0.  1.  1.  0.  1.  0.  0.  1.  0.]
 [ 0.  0.  1.  0.  0.  1.  0.  1.  0.  1.  0.  1.  0.  0.]]

if you use pandas, you can compare the results and hopefully gives you a better intuition:

encoded = pd.get_dummies(df)
print(encoded)

result:

     Feat1_A  Feat1_B  Feat1_C  Feat1_D  Feat2_B  Feat2_C  Feat2_D  Feat3_A  \
0        1        0        0        0        1        0        0        1   
1        0        1        0        0        1        0        0        0   
2        0        0        0        1        0        0        1        1   
3        0        0        1        0        0        1        0        1   

     Feat3_C  Feat4_A  Feat4_C  Feat5_A  Feat5_B  Feat5_C  
0        0        1        0        1        0        0  
1        1        0        1        0        0        1  
2        0        1        0        0        1        0  
3        0        1        0        1        0        0  

which is exactly the same!

Upvotes: 9

Related Questions