Reputation: 3823
I have quite a big dataframe in a shape like
animal ids
cat 1,3,4
dog 1,2,4
hamster 5
dolphin 3,5
It has about 60k rows, and ids column has over 100k comma separated integers for many rows, and most have over 10k ids. Trying to run
u = df["ids"].str.get_dummies(",")
so that I can calculate jaccard index, but due to the data size, it crashes with Memory error, because series.str.get_dummies() is using Int64 as dtype, and there is no way to change it, at least I don't how, as str.get_dummies() doesn't have dtype parameter.
So I tried to run instead
u = pd.get_dummies(df, columns=["ids"], dtype=np.uint8)
which worked, but it produces totally different result.
For example, if we run u = df["ids"].str.get_dummies(",")
on the example above, it produces
1 2 3 4 5
0 1 0 1 1 0
1 1 1 0 1 0
2 0 0 0 0 1
3 0 0 1 0 1
and if we run u = pd.get_dummies(df, columns=["ids"], dtype=np.uint8)
, it gives
animal ids_1,2,4 ids_1,3,4 ids_3,5 ids_5
0 cat 0 1 0 0
1 dog 1 0 0 0
2 hamster 0 0 0 1
3 dolphin 0 0 1 0
Is there a way either setting the dtype to uint8 for df["ids"].str.get_dummies(",")
or can I get similar result using pd.get_dummies(df, columns=["ids"], dtype=np.uint8)
?
Upvotes: 3
Views: 357
Reputation: 75100
For large data it might be a good idea to use MultiLabelBinarizer
with sparse=True
which returns a sparsed matrix , we can then use : pd.DataFrame.sparse.from_spmatrix
to convert it back to a dataframe
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer(sparse_output=True)
output = pd.DataFrame.sparse.from_spmatrix(mlb.fit_transform(df['ids'].str.split(',')),
columns=mlb.classes_)
print(output)
1 2 3 4 5
0 1 0 1 1 0
1 1 1 0 1 0
2 0 0 0 0 1
3 0 0 1 0 1
Upvotes: 4