How to merge rows in a dataframe based on a column value?

Question

I have a data-set that is in the shape of this, where each row represents a in a specific match that is specified by the gameID.

  gameID          Won/Lost   Home   Away  metric2 metric3 metric4   team1 team2 team3 team4
2017020001         1          1      0      10      10      10      1     0     0      0
2017020001         0          0      1      10      10      10      0     1     0      0

The thing I want to do is create a function that takes the rows with the same gameID and joins them. As you can see in data example below, the two rows represents one game that is split up into a home team (row_1 ) and an away team (row_2). I want these two rows to sit on one row only.

Won/Lost  h_metric2 h_metric3 h_metric4 a_metric2 a_metric3 a_metric4 h_team1 h_team2 h_team3 h_team4 a_team1 a_team2 a_team3 a_team4
1            10       10         10        10         10        10      1       0        0      0         0      1        0      0

How do I get this result?

EDIT: I created too much confusion, posting my code so you can get a better grasp of the problem I want to solve.

import numpy as np
import pandas as pd
import requests
import json
from sklearn import preprocessing
from sklearn.preprocessing import OneHotEncoder

results = []
for game_id in range(2017020001, 2017020010, 1):
    url = 'https://statsapi.web.nhl.com/api/v1/game/{}/boxscore'.format(game_id)
r = requests.get(url)
game_data = r.json()

for homeaway in ['home','away']:

    game_dict = game_data.get('teams').get(homeaway).get('teamStats').get('teamSkaterStats')
    game_dict['team'] = game_data.get('teams').get(homeaway).get('team').get('name')
    game_dict['homeaway'] = homeaway
    game_dict['game_id'] = game_id
    results.append(game_dict)

df = pd.DataFrame(results)

df['Won/Lost'] = df.groupby('game_id')['goals'].apply(lambda g: (g == g.max()).map({True: 1, False: 0}))

df["faceOffWinPercentage"] = df["faceOffWinPercentage"].astype('float')
df["powerPlayPercentage"] = df["powerPlayPercentage"].astype('float')
df["team"] = df["team"].astype('category')
df = pd.get_dummies(df, columns=['homeaway'])
df = pd.get_dummies(df, columns=['team'])

timgeb · Accepted Answer

This is under the assumption that you have exactly two rows per gameID and that you want to group by that ID. (It also assumes that I understand the question.)

Improved solution

Given a dataframe df such as

       gameID  Won/Lost  Home  Away  metric2  metric3  metric4  team1  team2  team3  team4
0  2017020001         1     1     0       10       10       10      1      0      0      0
1  2017020001         0     0     1       10       10       10      0      1      0      0
2  2017020002         1     1     0       10       10       10      1      0      0      0
3  2017020002         0     0     1       10       10       10      0      1      0      0

you can use pd.merge (and some data munging) like this:

>>> is_home = df['Home'] == 1                                                                                                                                                                                                                   
>>> home = df[is_home].drop(['Home', 'Away'], axis=1).add_prefix('h_').rename(columns={'h_gameID':'gameID'})                                                                                                                                    
>>> away = df[~is_home].drop(['Won/Lost', 'Home', 'Away'], axis=1).add_prefix('a_').rename(columns={'a_gameID':'gameID'})                                                                                                                       
>>> pd.merge(home, away, on='gameID')                                                                                                                                                                                                           
       gameID  h_Won/Lost  h_metric2  h_metric3  h_metric4  h_team1  h_team2  h_team3  h_team4  a_metric2  a_metric3  a_metric4  a_team1  a_team2  a_team3  a_team4
0  2017020001           1         10         10         10        1        0        0        0         10         10         10        0        1        0        0
1  2017020002           1         10         10         10        1        0        0        0         10         10         10        0        1        0        0

(I kept the prefix for Won/Lost because it indicates that it's the statistic for the home team. Also, if anybody knows how to add the prefixes more elegantly without having to re-rename the gameID please leave a comment.)

Original Attempt

You can apply the following function after grouping

def munge(group): 
     is_home = group.Home == 1 
     wonlost = group.loc[is_home, 'Won/Lost'].reset_index(drop=True) 
     group = group.loc[:, 'metric2':] 
     home = group[is_home].add_prefix('h_').reset_index(drop=True) 
     away = group[~is_home].add_prefix('a_').reset_index(drop=True) 
     return pd.concat([wonlost, home, away], axis=1)

... like this:

>>> df.groupby('gameID').apply(munge).reset_index(level=1, drop=True)                                                                                                                                                                           
            Won/Lost  h_metric2  h_metric3  h_metric4  h_team1  h_team2  h_team3  h_team4  a_metric2  a_metric3  a_metric4  a_team1  a_team2  a_team3  a_team4
gameID                                                                                                                                                        
2017020001         1         10         10         10        1        0        0        0         10         10         10        0        1        0        0
2017020002         1         10         10         10        1        0        0        0         10         10         10        0        1        0        0

How to merge rows in a dataframe based on a column value?

Answers (2)

Related Questions