Craig
Craig

Reputation: 1985

Pandas: encoding row values/transform dataframe

This is a little similar to this answer but my problem is a little different.

My dataframe:

df=pd.DataFrame([{'date': '2017-01-01', 'id': 'google', 'category': 1, 'Score1': 51, 'Score2': 531},
         {'date': '2017-01-01', 'id': 'google', 'category': 2, 'Score1': 592, 'Score2': 152},
         {'date': '2017-01-01', 'id': 'google', 'category': 5, 'Score1': 55, 'Score2': 255},
         {'date': '2017-01-01', 'id': 'yahoo', 'category': 7, 'Score1': 597, 'Score2': 357},
         {'date': '2017-01-01', 'id': 'yahoo', 'category': 8, 'Score1': 58, 'Score2': 58},
         {'date': '2017-01-02', 'id': 'google', 'category': 5, 'Score1': 795, 'Score2': 455},
         {'date': '2017-01-02', 'id': 'google', 'category': 1, 'Score1': 71, 'Score2': 751},
         {'date': '2017-01-02', 'id': 'google', 'category': 2, 'Score1': 792, 'Score2': 352},
         {'date': '2017-01-02', 'id': 'yahoo', 'category': 7, 'Score1': 77, 'Score2': 957},
         {'date': '2017-01-02', 'id': 'yahoo', 'category': 8, 'Score1': 798, 'Score2': 358}
       ])

And looks like this:

         date      id  category  Score1  Score2
0  2017-01-01  google         1      51     531
1  2017-01-01  google         2     592     152
2  2017-01-01  google         5      55     255
3  2017-01-01   yahoo         7     597     357
4  2017-01-01   yahoo         8      58      58
5  2017-01-02  google         5     795     455
6  2017-01-02  google         1      71     751
7  2017-01-02  google         2     792     352
8  2017-01-02   yahoo         7      77     957
9  2017-01-02   yahoo         8     798     358

I need to transform this into a dataframe that looks like this:

date    id  cat1_score1 cat2_score1 cat5_score1 cat7_score1 cat8_score1 cat1_score2 cat2_score2 cat5_score2 cat7_score2 cat8_score2
1/1/17  google    51        592         55           0            0            531       152        255           0         0
1/1/17  yahoo      0        0            0          597          58             0         0          0           357        58
1/2/17  google    71        792        795           0            0            751       352        455           0         0
1/2/17  yahoo      0         0           0          77           798            0         0          0           957        358

The caveat here is that the number of categories can vary from id to id. There could also potentially be a secondary ID column which would need to be considered. I could enumerate over the values in the id column but then how would I transform the dataframe accordingly?

Upvotes: 1

Views: 159

Answers (1)

Scott Boston
Scott Boston

Reputation: 153460

Let's reshape like this:

df_out = df.set_index(['date', 'id', 'category']).stack().unstack([2, 3]).fillna(0)
df_out.columns = df_out.columns.map('{0[0]}_{0[1]}'.format)
df_out = df_out.add_prefix('cat').reset_index()
print(df_out)

Output:

         date      id  cat1_Score1  cat1_Score2  cat2_Score1  cat2_Score2  \
0  2017-01-01  google         51.0        531.0        592.0        152.0   
1  2017-01-01   yahoo          0.0          0.0          0.0          0.0   
2  2017-01-02  google         71.0        751.0        792.0        352.0   
3  2017-01-02   yahoo          0.0          0.0          0.0          0.0   

   cat5_Score1  cat5_Score2  cat7_Score1  cat7_Score2  cat8_Score1  \
0         55.0        255.0          0.0          0.0          0.0   
1          0.0          0.0        597.0        357.0         58.0   
2        795.0        455.0          0.0          0.0          0.0   
3          0.0          0.0         77.0        957.0        798.0   

   cat8_Score2  
0          0.0  
1         58.0  
2          0.0  
3        358.0 

Explanation: First set_index on everything but scoring columns, then use stack to align all the scores vertically. Next, use unstack to pivot category and score labels to columns and fillna with zeroes. Use columns map to combine column levels into one level and add_prefix 'cat' to columns. Lastly, reset_index get all columns back into the dataframe from index.

Upvotes: 3

Related Questions