jxo
jxo

Reputation: 45

Dummy/One Hot Encoding of Comma Separated column in Dask

I have a column in a dask data frame that contains comma separated lists of different categories. I'm looking to replicate the functionality of sklearn's MultiLabelBinarizer or the pandas function pd.get_dummies(',') exactly as this thread describes: Create dummies from column with multiple values in dask

Is there absolutely no way to do this as the one answer there states? Is there a way to implement this if I got a list of all of the values?

Upvotes: 1

Views: 291

Answers (1)

SultanOrazbayev
SultanOrazbayev

Reputation: 16581

If the list of all classes are known, then it's an easy task for dask:

import dask.dataframe as dd
import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer

df = pd.DataFrame({"col_a": ["c, d", "e", "g", "e, g", "d, e"]})
all_classes = ["c", "d", "e", "g"]
mlb = MultiLabelBinarizer(classes=all_classes)

def myfunc(df):
    return pd.DataFrame(mlb.fit_transform(df["col_a"].values), columns=all_classes)

ddf = dd.from_pandas(df, npartitions=2)

ddf.map_partitions(myfunc, meta=pd.DataFrame(columns=all_classes)).compute()

If the list is not known, then one option is to do a first pass through the dataframe, collecting all unique values, then integrating these classes into a snippet similar to above.

Upvotes: 2

Related Questions