Reputation: 90
I have a data frame with four columns, track,num_tracks playlist, cluster. My goal is to create a new data frame that will output a row that contains the track,pid and columns for each unique value in cluster with its corresponding count.
Here is a sample dataframe:
pid track cluster num_track
0 1 6 4
0 2 1 4
0 3 6 4
0 4 3 4
1 5 10 3
1 6 10 3
1 7 1 4
2 8 9 5
2 9 11 5
2 10 2 5
2 11 2 5
2 12 2 5
So my desired output would be:
pid track cluster num_track c1 c2 c3 c4 c5 c6 c7 ... c12
0 1 6 4 1 0 1 0 0 2 0 0
0 2 1 4 1 0 1 0 0 2 0 0
0 3 6 4 1 0 1 0 0 2 0 0
0 4 3 4 1 0 1 0 0 2 0 0
1 5 10 3 1 0 0 0 0 0 0 0
1 6 10 3 1 0 0 0 0 0 0 0
1 7 1 3 1 0 0 0 0 0 0 0
2 8 9 5 0 3 0 0 0 0 0 0
2 9 11 5 0 3 0 0 0 0 0 0
2 10 2 5 0 3 0 0 0 0 0 0
2 11 2 5 0 3 0 0 0 0 0 0
2 12 2 5 0 3 0 0 0 0 0 0
I hope I have presented my question correctly if anything is incorrect tell me! I haven't enough rep to set up a bounty yet but could repost when I have enough. Any help would be appreciated!!
Upvotes: 3
Views: 186
Reputation: 323376
You can using crosstab
with reindex
, then concat
back to original df
s=pd.crosstab(df.pid,df.cluster).reindex(df.pid)
s.index=df.index
df=pd.concat([df,s.add_prefix('c')],1)
df
Out[209]:
pid track cluster num_track c1 c2 c3 c6 c9 c10 c11
0 0 1 6 4 1 0 1 2 0 0 0
1 0 2 1 4 1 0 1 2 0 0 0
2 0 3 6 4 1 0 1 2 0 0 0
3 0 4 3 4 1 0 1 2 0 0 0
4 1 5 10 3 1 0 0 0 0 2 0
5 1 6 10 3 1 0 0 0 0 2 0
6 1 7 1 4 1 0 0 0 0 2 0
7 2 8 9 5 0 3 0 0 1 0 1
8 2 9 11 5 0 3 0 0 1 0 1
9 2 10 2 5 0 3 0 0 1 0 1
10 2 11 2 5 0 3 0 0 1 0 1
11 2 12 2 5 0 3 0 0 1 0 1
Upvotes: 5