Rukesh Dutta
Rukesh Dutta

Reputation: 69

Want to create a sparse matrix like dataframe from a dataframe in pandas/python

I have a data frame like this input data

I want to convert it to something like this,note the ds is the day someone visited,and will have values from 0 to 31, for the days not visited it will show 0, and for the days visited it will show 1. It's kinda like sparse matrix,can someone help desired result

Upvotes: 2

Views: 159

Answers (2)

sim
sim

Reputation: 1257

Update: pd.get_dummies now accepts sparse=True to create a SparseArray output.

pd.get_dummies(s: pd.Series) can be used to create a one-hot encoding like such:

header = ["ds", "buyer_id", "email_address"]
data = [[23, 305, "[email protected]"],
        [22, 307, "[email protected]"],
        [25, 411, "[email protected]"],
        [22, 588, "[email protected]"],
        [24, 664, "[email protected]"]]
df = pd.DataFrame(data, columns=header)
df.join(pd.get_dummies(df["ds"]))

output:

ds  buyer_id    email_address   22  23  24  25
0   23  305     [email protected]  0   1   0   0
1   22  307     [email protected]     1   0   0   0
2   25  411     [email protected]   0   0   0   1
3   22  588     [email protected]  1   0   0   0
4   24  664     [email protected]     0   0   1   0

Just for added clarification: The resulting dataframe is still stored in a dense format. You could use scipy.sparse matrix formats to store it in a true sparse format.

Upvotes: 1

Prachi Chitnis
Prachi Chitnis

Reputation: 45

Adding to the solution from @sim. By using the parameter columns, one can avoid the join. the sparse=True parameter will return a sparse matrix. sparse=False will return a dense matrix.

header = ["ds", "buyer_id", "email_address"]
data = [[23, 305, "[email protected]"],
        [22, 307, "[email protected]"],
        [25, 411, "[email protected]"],
        [22, 588, "[email protected]"],
        [24, 664, "[email protected]"]]
df = pd.DataFrame(data, columns=header)
df=pd.get_dummies(df,columns=['ds'],sparse=True)

If you use sparse=True, the result can be converted back to dense using sparse.to_dense() on the specific column. For more details refer to User Guide

ds_cols=[col for col in df.columns if col.startswith('ds_')]
df=pd.concat([df[['buyer_id',"email_address"]],
                           df[ds_cols].sparse.to_dense()],axis=1)

Upvotes: 2

Related Questions