Reputation: 1985
This is a little similar to this answer but my problem is a little different.
My dataframe:
df=pd.DataFrame([{'date': '2017-01-01', 'id': 'google', 'category': 1, 'Score1': 51, 'Score2': 531},
{'date': '2017-01-01', 'id': 'google', 'category': 2, 'Score1': 592, 'Score2': 152},
{'date': '2017-01-01', 'id': 'google', 'category': 5, 'Score1': 55, 'Score2': 255},
{'date': '2017-01-01', 'id': 'yahoo', 'category': 7, 'Score1': 597, 'Score2': 357},
{'date': '2017-01-01', 'id': 'yahoo', 'category': 8, 'Score1': 58, 'Score2': 58},
{'date': '2017-01-02', 'id': 'google', 'category': 5, 'Score1': 795, 'Score2': 455},
{'date': '2017-01-02', 'id': 'google', 'category': 1, 'Score1': 71, 'Score2': 751},
{'date': '2017-01-02', 'id': 'google', 'category': 2, 'Score1': 792, 'Score2': 352},
{'date': '2017-01-02', 'id': 'yahoo', 'category': 7, 'Score1': 77, 'Score2': 957},
{'date': '2017-01-02', 'id': 'yahoo', 'category': 8, 'Score1': 798, 'Score2': 358}
])
And looks like this:
date id category Score1 Score2
0 2017-01-01 google 1 51 531
1 2017-01-01 google 2 592 152
2 2017-01-01 google 5 55 255
3 2017-01-01 yahoo 7 597 357
4 2017-01-01 yahoo 8 58 58
5 2017-01-02 google 5 795 455
6 2017-01-02 google 1 71 751
7 2017-01-02 google 2 792 352
8 2017-01-02 yahoo 7 77 957
9 2017-01-02 yahoo 8 798 358
I need to transform this into a dataframe that looks like this:
date id cat1_score1 cat2_score1 cat5_score1 cat7_score1 cat8_score1 cat1_score2 cat2_score2 cat5_score2 cat7_score2 cat8_score2
1/1/17 google 51 592 55 0 0 531 152 255 0 0
1/1/17 yahoo 0 0 0 597 58 0 0 0 357 58
1/2/17 google 71 792 795 0 0 751 352 455 0 0
1/2/17 yahoo 0 0 0 77 798 0 0 0 957 358
The caveat here is that the number of categories can vary from id
to id
. There could also potentially be a secondary ID column which would need to be considered.
I could enumerate
over the values in the id
column but then how would I transform the dataframe accordingly?
Upvotes: 1
Views: 159
Reputation: 153460
Let's reshape like this:
df_out = df.set_index(['date', 'id', 'category']).stack().unstack([2, 3]).fillna(0)
df_out.columns = df_out.columns.map('{0[0]}_{0[1]}'.format)
df_out = df_out.add_prefix('cat').reset_index()
print(df_out)
Output:
date id cat1_Score1 cat1_Score2 cat2_Score1 cat2_Score2 \
0 2017-01-01 google 51.0 531.0 592.0 152.0
1 2017-01-01 yahoo 0.0 0.0 0.0 0.0
2 2017-01-02 google 71.0 751.0 792.0 352.0
3 2017-01-02 yahoo 0.0 0.0 0.0 0.0
cat5_Score1 cat5_Score2 cat7_Score1 cat7_Score2 cat8_Score1 \
0 55.0 255.0 0.0 0.0 0.0
1 0.0 0.0 597.0 357.0 58.0
2 795.0 455.0 0.0 0.0 0.0
3 0.0 0.0 77.0 957.0 798.0
cat8_Score2
0 0.0
1 58.0
2 0.0
3 358.0
Explanation: First set_index
on everything but scoring columns, then use stack
to align all the scores vertically. Next, use unstack
to pivot category and score labels to columns and fillna
with zeroes. Use columns map
to combine column levels into one level and add_prefix
'cat' to columns. Lastly, reset_index
get all columns back into the dataframe from index.
Upvotes: 3