Reputation: 158
I have a dataset consisting sports teams performance across the seasons, from which I have created a melted dataset separating the home and away teams.
Original dataset:
Div Date HomeTeam AwayTeam FTHG FTAG FTR HTHG HTAG HTR Referee
0 E0 11/08/17 Arsenal Leicester 4 3 H 2 2 D M Dean
1 E0 12/08/17 Brighton Man City 0 2 A 0 0 D M Oliver
2 E0 12/08/17 Chelsea Burnley 2 3 A 0 3 A C Pawson
Melted dataset:
Div Date HomeTeam AwayTeam FTHG FTAG FTR HTHG HTAG \
0 E0 2017-08-11 Arsenal Leicester 4 3 H 2 2
1 E0 2017-08-11 Arsenal Leicester 4 3 H 2 2
2 E0 2017-08-12 Watford Liverpool 3 3 D 2 1
3 E0 2017-08-12 West Brom Bournemouth 1 0 H 1 0
4 E0 2017-08-12 Crystal Palace Huddersfield 0 3 A 0 2
HTR Referee Home/Away Team Opponent
0 D M Dean 1 Arsenal Leicester
1 D M Dean 0 Leicester Arsenal
2 H A Taylor 0 Liverpool Watford
3 H R Madley 0 Bournemouth West Brom
4 A J Moss 0 Huddersfield Crystal Palace
I have also added extra columns to calculate cumulative goals/ whatnot. So realistically, it looks like
Date Home/Away Team Opponent Cumg Cumc Result \
0 2017-08-11 1 Arsenal Leicester 0.0 0.0 1
1 2017-08-11 0 Leicester Arsenal 0.0 0.0 1
2 2017-08-12 0 Liverpool Watford 0.0 0.0 2
3 2017-08-12 0 Bournemouth West Brom 0.0 0.0 1
4 2017-08-12 0 Huddersfield Crystal Palace 0.0 0.0 0
Cumw Cuml Cumd Cumtr win_streak lose_streak
0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0
I would like to "unmelt" the dataset by returning to the original format, but keeping the new columns i have added. Like this
Date Home/Away Team Opponent Cumg_team Cumc_team Result \
0 2017-08-11 1 Arsenal Leicester 0.0 0.0 1
1 2017-08-19 0 Arsenal Stoke 3.0 4.0 1
2 2017-08-27 0 Arsenal Liverpool 3.0 5.0 1
3 2017-09-09 1 Arsenal Bournemouth 3.0 9.0 1
4 2017-09-17 0 Arsenal Chelsea 3.0 12.0 2
Cumw_team Cuml_team Cumd Cumtr_team win_streak_team lose_streak_team \
0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 1.0 0.0 0.0 0.0 0.0
2 0.0 2.0 0.0 0.0 1.0 0.0
3 0.0 3.0 0.0 0.0 0.0 0.0
4 0.0 4.0 0.0 0.0 0.0 0.0
Cumw_opponent Cuml_opponent Cumg_opponent Cumc_opponent Cumtr_opponent \
0 0.0 0.0 0.0 0.0 0.0
1 0.0 1.0 0.0 1.0 0.0
2 0.0 1.0 3.0 4.0 1.0
3 2.0 1.0 4.0 2.0 6.0
4 3.0 1.0 7.0 6.0 9.0
win_streak_opponent lose_streak_opponent
0 0.0 0.0
1 0.0 0.0
2 0.0 0.0
3 0.0 0.0
4 0.0 0.0
I could do this by
df1 = df[df['Team']=='Arsenal'].set_index('Date')
df2 = df[df['Opponent']=='Arsenal].set_index('Date')
df3 = df1.join(df2).reset_index()
But that's not very efficient, i wonder if there is anyway i can do this with purej SQL-like manipulation of the pandas df?
Upvotes: 0
Views: 336
Reputation: 862601
It seems you need:
df1 = df.set_index(['Date', 'Team'])
df2 = df.set_index(['Date', 'Opponent'])
df3 = pd.concat([df1, df2], axis=1)
Upvotes: 1