Reputation: 121
I have a dataframe with the following columns:
| winner | loser | tournament |
+--------+---------+------------+
| John | Steve | A |
+--------+---------+------------+
| Steve | John | B |
+--------+---------+------------+
| John | Michael | A |
+--------+---------+------------+
| Steve | John | A |
+--------+---------+------------+
What I want to do is to calculate the historical win % of the winner and the loser, for the given tournament type, and put this in its own column.
An example filling out the above table is below. Games will be referred to as (winner, loser, type).
I have also added the intermediate calculation columns too to make it clearer.
1) For the first game (John, Steve, A). There are no previous games of type A. So we fill with 0s.
2) For the second game (Steve, John, B). There are no previous games of type B. So we fill with 0s.
3) For the third game (John, Michael, A). There are previous games of Type A, so we can get the info. First, John is the winner. He has won 1 game of type A in the first row of the table. So we put winner wins = 1. John has not lost a game of type A before, so we put winner losses = 0. Michael does not have any game history, so we fill loser wins = 0 and loser losses = 0.
4) For the 4th game, (Steve, John, A). We see Steve has not previously won any games of type A, so we put winner wins = 0. He has lost 1 game of type A (first row). So we put winner losses = 1. John has won 2 games of type A, so loser wins = 2. He has lost
+--------+---------+------------+-------------+------------+---------------+--------------+--------------+-------------+
| winner | loser | tournament | winner wins | loser wins | winner losses | loser losses | winner win % | loser win % |
+--------+---------+------------+-------------+------------+---------------+--------------+--------------+-------------+
| John | Steve | A | 0 | 0 | 0 | 0 | 0/(0+0) | 0/(0+0) |
+--------+---------+------------+-------------+------------+---------------+--------------+--------------+-------------+
| Steve | John | B | 0 | 0 | 0 | 0 | 0/(0+0) | 0/(0+0) |
+--------+---------+------------+-------------+------------+---------------+--------------+--------------+-------------+
| John | Michael | A | 1 | 0 | 0 | 0 | 1/(1+0) | 0/(0+0) |
+--------+---------+------------+-------------+------------+---------------+--------------+--------------+-------------+
| Steve | John | A | 0 | 2 | 1 | 0 | 0/(0+1) | 2/(2+0) |
+--------+---------+------------+-------------+------------+---------------+--------------+--------------+-------------
Upvotes: 2
Views: 1159
Reputation: 7361
This shoud produce the expected result:
def win_los_percent(sdf):
sdf['winner wins'] = sdf.groupby('winner').cumcount()
sdf['winner losses'] = [(sdf.loc[0:i, 'loser'] == sdf.loc[i, 'winner']).sum() for i in sdf.index]
sdf['loser losses'] = sdf.groupby('loser').cumcount()
sdf['loser wins'] = [(sdf.loc[0:i, 'winner'] == sdf.loc[i, 'loser']).sum() for i in sdf.index]
sdf['winner win %'] = sdf['winner wins'] / (sdf['winner wins'] + sdf['winner losses'])
sdf['loser win %'] = sdf['loser wins'] / (sdf['loser wins'] + sdf['loser losses'])
return sdf
ddf = df.groupby('tournament').apply(win_los_percent)
Using the data provided, ddf
is:
winner loser tournament winner wins winner losses loser losses loser wins winner win % loser win %
0 John Steve A 0 0 0 0 NaN NaN
1 Steve John B 0 0 0 0 NaN NaN
2 John Michael A 1 0 0 0 1.0 NaN
3 Steve John A 0 1 0 2 0.0 1.0
pandas groupby is used to group the data of the same tournament, and pass the subdataframe to the win_los_percent
function. The return value of this function is returned to build the final dataframe.
For each subset, the function calculates the several columns:
sdf['winner wins']
and sdf['loser losses']
are obtained by using cumcount: for each row, this method counts the previous occurrences of the value (the player name) in the grouping column.sdf['winner losses']
and sdf['loser wins']
is a bit more elaborated, since we need to count the previous occurrence of a value (player name) in another column.'winner'
is equal to the playeer name at current row in column loser
or viceversa. sum()
allows to count the True values: True is casted to 1, False to 0, the sum gives the wanted result: how many times the player name is present in previous rows.NaN
is because there is a division by 0.Upvotes: 1
Reputation: 5451
Here is my attempt
Explanation
d = {
"winner": "John Steve John Steve".split(),
"loser": "Steve John Michael John".split(),
"tournament": "A B A A".split()
}
def calc_wining_percent (group):
group["wining_percent"] = group["won"].cumsum().shift()/np.arange(1, len(group)+1)
return group
df = pd.DataFrame(d)
new_df = pd.DataFrame(np.ravel(df[["winner", "loser"]]), columns=["player"])
new_df["won"]= np.r_[np.ones(len(df)), np.zeros(len(df))]
new_df["tournament"] = np.tile(df["tournament"],2)
new_df = new_df.groupby(["player", "tournament"]).apply(calc_wining_percent)
df["winner win %"] = new_df["wining_percent"].values.reshape(-1,2)[:,0]
df["loser win %"] = new_df["wining_percent"].values.reshape(-1,2)[:,1]
display("result", df)
Upvotes: 0
Reputation: 119
The first solution that comes to mind to me is to use object-oriented programming. I am proposing some guidelines of an implementation below.
You can create a player
class with attributes firstName
, lastName
, numberWins
, and numberLosses
. In the constructor, you can set numberWins
and numberLosses
to 0. I also recommend adding a wins
method and loses
method, which will, respectively, increment numberWins
and numberLosses
by 1.
After each match, you can increment the attributes in accordance with the outcome of the match (e.g. if I lose, my number of losses increases by 1, if I win, my number of wins increases by 1). At any time, you can get the historical win % with player.numberWins / (player.numberWins + player.numberLosses)
.
If you want to compute this ratio by tournament, you can then define the numberWins
and numberLosses
attributes to be data frames. where each column represents a tournament, (e.g. column labels are A, B, C...), and each row is a corresponding number of wins. Then, you would also add an argument in the wins
and loses
methods to specify the tournament at hand. (You could even define a score
class for more flexibility)
In your big data frame where you record matches, you can then actually store two player objects (instead of Strings identifying the players, as you do right now) and change their attributes as you go.
Then, there might also be some quick Math sorcery. However, the object-oriented implementation seems particularly appropriate in your case, at first sight (a player is, in the end, an object, with its own attributes and methods...).
Upvotes: 0