H. Saxena
H. Saxena

Reputation: 345

Encode categorical features with multiple categories per example - sklearn

I'm working on a movie dataset which contains genre as a feature. The examples in the dataset may belong to multiple genres at the same time. So, they contain a list of genre labels.

The data looks like this-

    movieId                                         genres
0        1  [Adventure, Animation, Children, Comedy, Fantasy]
1        2                     [Adventure, Children, Fantasy]
2        3                                  [Comedy, Romance]
3        4                           [Comedy, Drama, Romance]
4        5                                           [Comedy]

I want to vectorize this feature. I have tried LabelEncoder and OneHotEncoder, but they can't seem to handle these lists directly.

I could vectorize this manually, but I have other similar features that contain too many categories. For those I'd prefer some way to use the FeatureHasher class directly.

Is there some way to get these encoder classes to work on such a feature? Or is there a better way to represent such a feature that will make encoding easier? I'd gladly welcome any suggestions.

Upvotes: 4

Views: 2582

Answers (1)

Peter Leimbigler
Peter Leimbigler

Reputation: 11105

This SO question has some impressive answers. On your example data, the last answer by Teoretic (using sklearn.preprocessing.MultiLabelBinarizer) is 14 times faster than the solution by Paulo Alves (and both are faster than the accepted answer!):

from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
encoded = pd.DataFrame(mlb.fit_transform(df['genres']), columns=mlb.classes_, index=df.index)
result = pd.concat([df['movieId'], encoded], axis=1)

# Increase max columns to print the entire resulting DataFrame
pd.options.display.max_columns = 50
result
   movieId  Adventure  Animation  Children  Comedy  Drama  Fantasy  Romance
0        1          1          1         1       1      0        1        0
1        2          1          0         1       0      0        1        0
2        3          0          0         0       1      0        0        1
3        4          0          0         0       1      1        0        1
4        5          0          0         0       1      0        0        0

Upvotes: 7

Related Questions