Reputation: 69
I want to convert it to something like this,note the ds is the day someone visited,and will have values from 0 to 31, for the days not visited it will show 0, and for the days visited it will show 1. It's kinda like sparse matrix,can someone help
Upvotes: 2
Views: 159
Reputation: 1257
Update: pd.get_dummies
now accepts sparse=True
to create a SparseArray
output.
pd.get_dummies(s: pd.Series)
can be used to create a one-hot encoding like such:
header = ["ds", "buyer_id", "email_address"]
data = [[23, 305, "[email protected]"],
[22, 307, "[email protected]"],
[25, 411, "[email protected]"],
[22, 588, "[email protected]"],
[24, 664, "[email protected]"]]
df = pd.DataFrame(data, columns=header)
df.join(pd.get_dummies(df["ds"]))
output:
ds buyer_id email_address 22 23 24 25
0 23 305 [email protected] 0 1 0 0
1 22 307 [email protected] 1 0 0 0
2 25 411 [email protected] 0 0 0 1
3 22 588 [email protected] 1 0 0 0
4 24 664 [email protected] 0 0 1 0
Just for added clarification: The resulting dataframe is still stored in a dense format. You could use scipy.sparse
matrix formats to store it in a true sparse format.
Upvotes: 1
Reputation: 45
Adding to the solution from @sim. By using the parameter columns, one can avoid the join. the sparse=True parameter will return a sparse matrix. sparse=False will return a dense matrix.
header = ["ds", "buyer_id", "email_address"]
data = [[23, 305, "[email protected]"],
[22, 307, "[email protected]"],
[25, 411, "[email protected]"],
[22, 588, "[email protected]"],
[24, 664, "[email protected]"]]
df = pd.DataFrame(data, columns=header)
df=pd.get_dummies(df,columns=['ds'],sparse=True)
If you use sparse=True, the result can be converted back to dense using sparse.to_dense() on the specific column. For more details refer to User Guide
ds_cols=[col for col in df.columns if col.startswith('ds_')]
df=pd.concat([df[['buyer_id',"email_address"]],
df[ds_cols].sparse.to_dense()],axis=1)
Upvotes: 2