Siva Naidu
Siva Naidu

Reputation: 71

Sklearn preprocessing label encoder is throwing error for mutiple columns

I have pandas Data Frame with following structure

item_condition_id                     category
brand_name                            category
price                                  float64
shipping                              category
main_category                         category
category                              category
sub_category                          category
hashing_feature_aa                     float64
hashing_feature_ab                     float64

Example with portion of data:

brand_name  shipping  main_category        category
Target         1         Women           Tops & Blouses
unknown        1          Home           Home Décor
unknown        0         Women            Jewelry
unknown        0         Women             Other

I have converted categorical (Strings) columns to numerical using below code.

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
for i in range(len(X)):
    X.iloc[:,i] = le.fit_transform(X.iloc[:,i])

After Conversion

   brand_name  shipping  main_category  category
        0         1              1         3
        1         1              0         0
        1         0              1         1
        1         0              1         2

This is working as expected but while trying apply inverse_transform to get the original categories from numerical categories it is throwing error.

for i in range(len(X)):
    X.iloc[:,i] = le.inverse_transform(X.iloc[:,i])

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

How to resolve this error in my case , what's wrong with my code ?

My goal is convert categorical (strings) features to numerical using Label Encoder in order to apply sklearn.feature_selection.SelectKbest.fit_transform(X,y), without label encoding this step is failing.

Thanks

Upvotes: 0

Views: 4238

Answers (1)

Marcus V.
Marcus V.

Reputation: 6859

Based on your clarification: Your problem is overwriting the instance of le in your loop, so that it is only trained on the last column. Based on your code I would suggest putting them in a dict, e.g. as follows:

from sklearn.preprocessing import LabelEncoder
le = {}
for i in range(len(X)):
    le[i] = LabelEncoder()
    X.iloc[:,i] = le[i].fit_transform(X.iloc[:,i])
# do stuff
for i in range(len(X)):
    X.iloc[:,i] = le[i].inverse_transform(X.iloc[:,i])

But as commented above, also look at this.

Upvotes: 1

Related Questions