Convert pd.get_dummies result to df.str.get_dummies

Question

I have quite a big dataframe in a shape like

animal    ids
cat       1,3,4
dog       1,2,4
hamster   5        
dolphin   3,5

It has about 60k rows, and ids column has over 100k comma separated integers for many rows, and most have over 10k ids. Trying to run

u = df["ids"].str.get_dummies(",")

so that I can calculate jaccard index, but due to the data size, it crashes with Memory error, because series.str.get_dummies() is using Int64 as dtype, and there is no way to change it, at least I don't how, as str.get_dummies() doesn't have dtype parameter.

So I tried to run instead

u = pd.get_dummies(df, columns=["ids"], dtype=np.uint8)

which worked, but it produces totally different result.

For example, if we run u = df["ids"].str.get_dummies(",") on the example above, it produces

   1  2  3  4  5
0  1  0  1  1  0
1  1  1  0  1  0
2  0  0  0  0  1
3  0  0  1  0  1

and if we run u = pd.get_dummies(df, columns=["ids"], dtype=np.uint8), it gives

    animal  ids_1,2,4  ids_1,3,4  ids_3,5  ids_5
0      cat          0          1        0      0
1      dog          1          0        0      0
2  hamster          0          0        0      1
3  dolphin          0          0        1      0

Is there a way either setting the dtype to uint8 for df["ids"].str.get_dummies(",") or can I get similar result using pd.get_dummies(df, columns=["ids"], dtype=np.uint8)?

anky · Accepted Answer

For large data it might be a good idea to use MultiLabelBinarizer with sparse=True which returns a sparsed matrix , we can then use : pd.DataFrame.sparse.from_spmatrix to convert it back to a dataframe

from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer(sparse_output=True)

output = pd.DataFrame.sparse.from_spmatrix(mlb.fit_transform(df['ids'].str.split(',')),
                                          columns=mlb.classes_)

print(output)

   1  2  3  4  5
0  1  0  1  1  0
1  1  1  0  1  0
2  0  0  0  0  1
3  0  0  1  0  1

Convert pd.get_dummies result to df.str.get_dummies

Answers (1)

Related Questions