Formatting issues with MultiLabelBinarizer() after reading CSV into Pandas

Question

I'd like to use MultiLabelBinarizer() to prepare a column containing labels that apply to a text. For example, predicting which genres a movie might fall under based on the title.

MultiLabelBinarizer() works great when the values are pre-defined as a list in the DataFrame:

import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer

df = pd.DataFrame({"Text": ["Blah blah", "Blah blah blah"],
              "Tag": [["Hi", "Hello"], ["Hey"]]})

mlb = MultiLabelBinarizer()
print(mlb.fit_transform(df["Tag"]))
print(mlb.classes_)

Gives

array([[1, 0, 1],
       [0, 1, 0]])

array(['Hello', 'Hey', 'Hi'], dtype=object)

However, this approach fails when I'm reading a CSV or Excel file into Pandas. For example, if I make a simple CSV with the same structure:

And read it into Pandas + use MultiLabelBinarizer():

df = pd.read_csv(filepath)

mlb = MultiLabelBinarizer()
print(mlb.fit_transform(df["Tag"]))
print(mlb.classes_)

It treats each character as a separate class and also doesn't output as array() anymore:

[[1 1 1 1 1 1 1 1 0]
 [0 1 0 1 1 0 0 0 1]]

[' ' '"' ',' 'H' 'e' 'i' 'l' 'o' 'y']

Given this limitation, how can I read from a CSV or Excel file and preserve the functionality of MultiLabelBinarizer()?

Michael Gardner · Accepted Answer

Add .str.split(",")

mlb.fit_transform(df["Tag"].str.split(","))

Formatting issues with MultiLabelBinarizer() after reading CSV into Pandas

Answers (1)

Related Questions