Reputation: 3494
I'm trying to better understand pandas' group operations.
As an example, let's say I have a dataframe which has a list of sets played in tennis matches.
tennis_sets = pd.DataFrame.from_items([
('date', ['27/05/13', '27/05/13', '28/05/13', '28/05/13',
'28/05/13', '29/05/13', '29/05/13']),
('player_A', [6, 6, 2, 6, 7, 6, 6]),
('player_B', [4, 3, 6, 7, 6, 1, 0])
])
Resulting in
date player_A player_B
0 27/05/13 6 4
1 27/05/13 6 3
2 28/05/13 2 6
3 28/05/13 6 7
4 28/05/13 7 6
5 29/05/13 6 1
6 29/05/13 6 0
I'd like to determine the overall score for each match played on a given day. This should look like
date player_A player_B
0 27/05/13 2 0
1 28/05/13 1 2
2 29/05/13 2 0
So, I could do this by creating a new numpy array and iterating as follows:
matches = tennis_sets.groupby('date')
scores = np.zeros((len(matches),2))
for i, (_, match) in enumerate(matches):
a, b = match.player_A, match.player_B
scores[i] = np.c_[sum(a>b), sum(b>a)]
I could then reattach this new scores array to the dates. However, it seems unlikely that this is the preferred way of doing things.
To create a new dataframe with each date and match score as above, is there a better way I can achieve this using pandas' api?
Upvotes: 0
Views: 157
Reputation: 17550
To answer your question, yes there are ways to do this in pandas. There may be a more elegant solution, but here's a quick one which uses pandas groupby to perform a sum over the dataframe grouped by date:
In [13]: tennis_sets
Out[13]:
date player_A player_B
0 27/05/13 6 4
1 27/05/13 6 3
2 28/05/13 2 6
3 28/05/13 6 7
4 28/05/13 7 6
5 29/05/13 6 1
6 29/05/13 6 0
In [14]: tennis_sets["pA_wins"] = tennis_sets["player_A"] > tennis_sets["player_B"]
In [15]: tennis_sets["pB_wins"] = tennis_sets["player_B"] > tennis_sets["player_A"]
In [18]: tennis_sets
Out[18]:
date player_A player_B pA_wins pB_wins
0 27/05/13 6 4 True False
1 27/05/13 6 3 True False
2 28/05/13 2 6 False True
3 28/05/13 6 7 False True
4 28/05/13 7 6 True False
5 29/05/13 6 1 True False
6 29/05/13 6 0 True False
In [21]: matches = tennis_sets.groupby("date").sum()
In [22]: matches[["pA_wins", "pB_wins"]]
Out[22]:
pA_wins pB_wins
date
27/05/13 2 0
28/05/13 1 2
29/05/13 2 0
Upvotes: 1