HJA24
HJA24

Reputation: 363

Measure balanceness of a weighted numpy array

I have player A and B who both played against different opponents.

player opponent days ago
A C 1
A C 2
A D 10
A F 100
A F 101
A F 102
A G 1
B C 1
B C 2
B D 10
B F 100
B F 101
B F 102
B G 1
B G 2
B G 3
B G 4
B G 5
B G 6
B G 7
B G 8

First, I want to find the opponent that is the most common one. My definition of "most common" is not the total number of matches but more like the balanced number of matches. If for example, player 1 and 2 played respectively 99 and 1 time(s) against player 3 I prefer opponent 4 where A and B played both 49 times against.

In order to measure the "balanceness" I write the following function:

import numpy as np
from collections import Counter


def balanceness(array: np.ndarray):
    classes = [(c, cnt) for c, cnt in Counter(array).items()]
    m = len(classes)
    n = len(array)

    H = -sum([(cnt / n) * np.log((cnt / n)) for c, cnt in classes])

    return H / np.log(m)

This functions works as expected:

>> balanceness(array=np.array([0, 0, 0, 1, 1, 1]))
1.0

If I run the function on the different opponents I see the following results:

opponent balanceness n_matches
C 1 4
D 1 2
F 1 6
G 0.5032583347756457 9

Clearly, opponent F is the most common one. However, the matches of A and B against F are relatively old.

How should I incorporate a recency-factor into my calculation to find the "most recent common opponent"?

Edit

After thinking more about it I decided to weight each match using the following function

def weight(days_ago: int, epilson: float=0.005) -> float:
    return np.exp(-1 * days_ago * epilson)

I sum the weight of all the matches against each opponent

opponent balanceness n_matches weighted_n_matches
C 1 4 3.9701246258837
D 1 2 1.90245884900143
F 1 6 3.62106362790388
G 0.5032583347756457 9 8.81753570603108

Now, opponent C is the "most-recent balanced opponent".

Nevertheless, this method ignores the "recentness" on a player-level because we sum the values. There could be a scenario where player 1 played recently a lot of matches against player 3 whereas player 2 faced player 3 in the distant past.

How can we find the opponent that is

  1. the most balanced / equally-distributed between two players
  2. the opponent with the most recent matches against the two players

Upvotes: 6

Views: 248

Answers (2)

kho
kho

Reputation: 1291

Here is a solution based on matrices to calculate all combinations of players, i.e. player and opponent. Since I assume that there is no difference between who is listed as player and opponent, i.e. ['A', 'C'] is equal to ['C', 'A'], the matrices are symmetric. Furthermore, I have come up with a slightly different approach to the definition of OPs balanceness. In my opinion a more robust and also simpler definition is: "The common players of players A and B are those who have the highest non-zero minimum score against players A and B".

Definitions

import sklearn.metrics
import numpy as np
import pandas as pd

def confusion_matrix(df, *args, **kwargs):
    labels = np.unique(df[['player', 'opponent']])
    x = sklearn.metrics.confusion_matrix(df['player'], df['opponent'], labels=labels, *args, **kwargs)
    return pd.DataFrame(x + x.T, columns=labels, index = labels)
    
def weight(days_ago: int, epilson: float=0.005) -> float:
    return np.exp(-np.abs(epilson) * days_ago)

def common_player_matrix(df_confusion, fill_values = ''):
    """The common player of the players A and B are defined as the 
    players which have the highest minimum-score (non zero)."""
    df_res = pd.DataFrame(columns=df_confusion.columns, index=df_confusion.index)
    x, y = np.meshgrid(df_confusion.index, df_confusion.columns)

    for i, j in zip(x.flatten(), y.flatten()):
        min_score = df_confusion[[i, j]].min(axis=1)
        max_of_min = list(min_score[(min_score == min_score.max()) & (min_score>0)].index)
        df_res.loc[i,j] = max_of_min if max_of_min != [] else fill_values
    return df_res

Example

df = pd.DataFrame({
 'player': ['A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B'],
 'opponent': ['C', 'D', 'F', 'F', 'F', 'G', 'C', 'C', 'D', 'F', 'F', 'F', 'G', 'G', 'G', 'G', 'G', 'G', 'G', 'G'],
 'days_ago': [2, 10, 100, 101, 102, 1, 1, 2, 11, 110, 101, 102, 1, 2, 3, 4, 5, 6, 7, 8]})

# w/o weights, i.e. simple counting the games
df_matrix_n = confusion_matrix(df)
df_matrix_n_common = common_player_matrix(df_matrix_n)
print(df_matrix_n_common)
#      A    B       C       D       F       G
# A  [F]  [F]                                
# B  [F]  [G]                                
# C               [B]  [A, B]     [B]     [B]
# D            [A, B]  [A, B]  [A, B]  [A, B]
# F               [B]  [A, B]  [A, B]     [B]
# G               [B]  [A, B]     [B]     [B]

# including weights - here the OPs weight function
df_matrix_weight = confusion_matrix(df, sample_weight=weight(df['days_ago']))
df_matrix_weight_common = common_player_matrix(df_matrix_weight)
print(df_matrix_weight_common)
#      A    B    C    D    F    G
# A  [F]  [F]                    
# B  [F]  [G]                    
# C            [B]  [A]  [B]  [B]
# D            [A]  [A]  [A]  [A]
# F            [B]  [A]  [A]  [B]
# G            [B]  [A]  [B]  [B]

# the opponent with the most recent matches against the two players
gb_days = df.groupby(['player', 'opponent']).days_ago.min()
df_last_match = confusion_matrix(gb_days.reset_index(), sample_weight=gb_days)
df_last_match_common = common_player_matrix(df_last_match)
print(df_last_match_common)
#      A    B       C       D       F       G
# A  [F]  [F]                                
# B  [F]  [F]                                
# C               [A]     [A]     [A]  [A, B]
# D               [A]     [B]     [B]  [A, B]
# F               [A]     [B]     [B]  [A, B]
# G            [A, B]  [A, B]  [A, B]  [A, B]

players_of_interest = ('A', 'B')
print(f'{players_of_interest} played:') 
print(f' - most against {df_matrix_n_common.loc[players_of_interest]}')
print(f' - recently most against {df_matrix_weight_common.loc[players_of_interest]}')
print(f' - most recently against {df_last_match_common.loc[players_of_interest]}')
# ('A', 'B') collectively played:
#  - most against ['F']
#  - recently most against ['F']
#  - most recently against ['F']

How to read the matrix:

As the matrices are symmetric, column and index can be flip in the following. The most frequent player of 'A' (index) and 'B' (column) regardless of the weights is 'F'. The most common player of 'C' (index) and 'D' (column) is with weights 'A' and without weights there are two 'A', 'B'. (column).

Upvotes: 1

Mercury
Mercury

Reputation: 4181

First, I think "balanceness" needs to consider how many days ago the matches were played. For example, suppose A and B played 1 match against C, both 100 days ago. Again, let A and B both play 1 match against E, 1 day and 199 days ago respectively. Although the number of matches is the same, their recency is different, and they shouldn't have the same balanceness score.

By using the defined weight(days_ago) function, it will be as if A and B both played 0.60 matches against C, while they played 0.995 and 0.36 matches against E respectively. These two scenarios should have different balanceness.

Second, just balanceness is obviously not enough. If A and B played 1 match each against D, both 100 years ago, and against E, both 200 years ago---both scenarios are equally "balanced". You need to define a "recency" score (between 0 and 1); I think average weight might work. And then you can combine the two metrics together in some way, e.g. B * R, or (B * R)/(B + R), or alpha * B + (1 - alpha) * R.

import numpy as np
import pandas as pd

data = [
    ["A", "C", 2],
    ["A", "D", 10],
    ["A", "F", 100],
    ["A", "F", 101],
    ["A", "F", 102],
    ["A", "G", 1],
    ["B", "C", 1],
    ["B", "C", 2],
    ["B", "D", 10],
    ["B", "F", 100],
    ["B", "F", 101],
    ["B", "F", 102],
    ["B", "G", 1],
    ["B", "G", 2],
    ["B", "G", 3],
    ["B", "G", 4],
    ["B", "G", 5],
    ["B", "G", 6],
    ["B", "G", 7],
    ["B", "G", 8]
]

def weight(days_ago: int, epilson: float=0.005) -> float:
    return np.exp(-1 * days_ago * epilson)

def weighted_balanceness(array: np.ndarray, weights: np.ndarray):
    classes = np.unique(array)
    cnt = np.array([weights[array == c].sum() for c in classes])
    m = len(classes)
    n = weights.sum()

    H = -(cnt / n * np.log(cnt / n)).sum() 
    return H / np.log(m)


df = pd.DataFrame(data=data, columns=["player", "opponent", "days_ago"])
df["effective_count"] = weight(df["days_ago"])

scores = []
for opponent in df["opponent"].unique():
    df_o = df.loc[df["opponent"] == opponent]
    player = np.where(df_o["player"].values == "A", 0, 1)
    balanceness = weighted_balanceness(array=player, weights=df_o["effective_count"])

    recency = df_o["effective_count"].mean()
    scores.append([opponent, balanceness, recency])


df_out = pd.DataFrame(scores, columns=["opponent", "balanceness", "recency"])
df_out["br"] = df_out["balanceness"] * df_out["recency"]
df_out["mean_br"] = 0.5 * df_out["balanceness"] + 0.5 * df_out["recency"]
df_out["harmonic_mean_br"] = df_out["balanceness"] * df_out["recency"] / ( (df_out["balanceness"] + df_out["recency"]))

print(df_out)

This gives me the following:

  opponent  balanceness   recency        br   mean_br  harmonic_mean_br
0        C     0.917739  0.991704  0.910125  0.954721          0.476644
1        D     1.000000  0.951229  0.951229  0.975615          0.487503
2        F     1.000000  0.603511  0.603511  0.801755          0.376368
3        G     0.508437  0.979726  0.498129  0.744082          0.334728

Note that D and F have perfect balanceness. They both played with A & B with same number of matches and same days ago. However, F played a while back (100-102 days ago), so they have a lower recency score, which hurts their combined scores.

Depending on how you combine b and r, most likely D or C would be the best choice (C may win if you give more weight to recency).

Upvotes: 3

Related Questions