Convex Leopard
Convex Leopard

Reputation: 121

Pandas - win ratio calculation; group by two columns and count

I have a dataframe with the following columns:

| winner |  loser  | tournament |
+--------+---------+------------+
| John   | Steve   |      A     |
+--------+---------+------------+
| Steve  | John    |      B     |
+--------+---------+------------+
| John   | Michael |      A     |
+--------+---------+------------+
| Steve  | John    |      A     |
+--------+---------+------------+

What I want to do is to calculate the historical win % of the winner and the loser, for the given tournament type, and put this in its own column.


An example filling out the above table is below. Games will be referred to as (winner, loser, type).

I have also added the intermediate calculation columns too to make it clearer.


1) For the first game (John, Steve, A). There are no previous games of type A. So we fill with 0s.

2) For the second game (Steve, John, B). There are no previous games of type B. So we fill with 0s.

3) For the third game (John, Michael, A). There are previous games of Type A, so we can get the info. First, John is the winner. He has won 1 game of type A in the first row of the table. So we put winner wins = 1. John has not lost a game of type A before, so we put winner losses = 0. Michael does not have any game history, so we fill loser wins = 0 and loser losses = 0.

4) For the 4th game, (Steve, John, A). We see Steve has not previously won any games of type A, so we put winner wins = 0. He has lost 1 game of type A (first row). So we put winner losses = 1. John has won 2 games of type A, so loser wins = 2. He has lost

    +--------+---------+------------+-------------+------------+---------------+--------------+--------------+-------------+
| winner |  loser  | tournament | winner wins | loser wins | winner losses | loser losses | winner win % | loser win % |
+--------+---------+------------+-------------+------------+---------------+--------------+--------------+-------------+
|  John  |  Steve  |      A     |      0      |      0     |       0       |       0      | 0/(0+0)      | 0/(0+0)     |
+--------+---------+------------+-------------+------------+---------------+--------------+--------------+-------------+
|  Steve |   John  |      B     |      0      |      0     |       0       |       0      | 0/(0+0)      | 0/(0+0)     |
+--------+---------+------------+-------------+------------+---------------+--------------+--------------+-------------+
|  John  | Michael |      A     |      1      |      0     |       0       |       0      | 1/(1+0)      | 0/(0+0)     |
+--------+---------+------------+-------------+------------+---------------+--------------+--------------+-------------+
|  Steve |   John  |      A     |      0      |      2     |       1       |       0      | 0/(0+1)      | 2/(2+0)     |
+--------+---------+------------+-------------+------------+---------------+--------------+--------------+-------------

Upvotes: 2

Views: 1159

Answers (3)

Valentino
Valentino

Reputation: 7361

This shoud produce the expected result:

def win_los_percent(sdf):
    sdf['winner wins'] = sdf.groupby('winner').cumcount()
    sdf['winner losses'] = [(sdf.loc[0:i, 'loser'] == sdf.loc[i, 'winner']).sum() for i in sdf.index]
    sdf['loser losses'] = sdf.groupby('loser').cumcount()
    sdf['loser wins'] = [(sdf.loc[0:i, 'winner'] == sdf.loc[i, 'loser']).sum() for i in sdf.index]
    sdf['winner win %'] = sdf['winner wins'] / (sdf['winner wins'] + sdf['winner losses'])
    sdf['loser win %'] = sdf['loser wins'] / (sdf['loser wins'] + sdf['loser losses'])
    return sdf

ddf = df.groupby('tournament').apply(win_los_percent)

Using the data provided, ddf is:

  winner    loser tournament  winner wins  winner losses  loser losses  loser wins  winner win %  loser win %
0   John    Steve          A            0              0             0           0           NaN          NaN
1  Steve     John          B            0              0             0           0           NaN          NaN
2   John  Michael          A            1              0             0           0           1.0          NaN
3  Steve     John          A            0              1             0           2           0.0          1.0

pandas groupby is used to group the data of the same tournament, and pass the subdataframe to the win_los_percent function. The return value of this function is returned to build the final dataframe.

For each subset, the function calculates the several columns:

  • sdf['winner wins'] and sdf['loser losses'] are obtained by using cumcount: for each row, this method counts the previous occurrences of the value (the player name) in the grouping column.
  • obtaining sdf['winner losses'] and sdf['loser wins'] is a bit more elaborated, since we need to count the previous occurrence of a value (player name) in another column.
    The list comprehension iterates over the dataframe index to select the previous rows and check if the player name in the column 'winner' is equal to the playeer name at current row in column loser or viceversa. sum() allows to count the True values: True is casted to 1, False to 0, the sum gives the wanted result: how many times the player name is present in previous rows.
  • Percentage columns are obtained with vectorization. Where the result is NaN is because there is a division by 0.

Upvotes: 1

Dev Khadka
Dev Khadka

Reputation: 5451

Here is my attempt

Explanation

  1. Create new dataframe by combining "winner" and "loser" columns to "player" column and indicate win/loose by "won" column
  2. Group new dataframe by "player" and "tournament" and calculate win % using "calc_winning_percent" function for each row
  3. now split win % of new_df into winner win % and loser win % and assign back to df
d = {
    "winner": "John Steve John Steve".split(),
    "loser": "Steve John Michael John".split(),
    "tournament": "A B A A".split()
}

def calc_wining_percent (group):
    group["wining_percent"] = group["won"].cumsum().shift()/np.arange(1, len(group)+1)
    return group

df = pd.DataFrame(d)

new_df =  pd.DataFrame(np.ravel(df[["winner", "loser"]]), columns=["player"])
new_df["won"]= np.r_[np.ones(len(df)), np.zeros(len(df))]
new_df["tournament"] = np.tile(df["tournament"],2)

new_df = new_df.groupby(["player", "tournament"]).apply(calc_wining_percent)

df["winner win %"] = new_df["wining_percent"].values.reshape(-1,2)[:,0]
df["loser win %"] = new_df["wining_percent"].values.reshape(-1,2)[:,1]

display("result", df)

Upvotes: 0

GabCaz
GabCaz

Reputation: 119

The first solution that comes to mind to me is to use object-oriented programming. I am proposing some guidelines of an implementation below.

You can create a player class with attributes firstName, lastName, numberWins, and numberLosses. In the constructor, you can set numberWins and numberLosses to 0. I also recommend adding a wins method and loses method, which will, respectively, increment numberWins and numberLosses by 1.

After each match, you can increment the attributes in accordance with the outcome of the match (e.g. if I lose, my number of losses increases by 1, if I win, my number of wins increases by 1). At any time, you can get the historical win % with player.numberWins / (player.numberWins + player.numberLosses).

If you want to compute this ratio by tournament, you can then define the numberWins and numberLossesattributes to be data frames. where each column represents a tournament, (e.g. column labels are A, B, C...), and each row is a corresponding number of wins. Then, you would also add an argument in the wins and loses methods to specify the tournament at hand. (You could even define a score class for more flexibility)

In your big data frame where you record matches, you can then actually store two player objects (instead of Strings identifying the players, as you do right now) and change their attributes as you go.

Then, there might also be some quick Math sorcery. However, the object-oriented implementation seems particularly appropriate in your case, at first sight (a player is, in the end, an object, with its own attributes and methods...).

Upvotes: 0

Related Questions