Reputation: 363
I have player A
and B
who both played against different opponents.
player | opponent | days ago |
---|---|---|
A | C | 1 |
A | C | 2 |
A | D | 10 |
A | F | 100 |
A | F | 101 |
A | F | 102 |
A | G | 1 |
B | C | 1 |
B | C | 2 |
B | D | 10 |
B | F | 100 |
B | F | 101 |
B | F | 102 |
B | G | 1 |
B | G | 2 |
B | G | 3 |
B | G | 4 |
B | G | 5 |
B | G | 6 |
B | G | 7 |
B | G | 8 |
First, I want to find the opponent that is the most common one. My definition of "most common" is not the total number of matches but more like the balanced number of matches.
If for example, player 1
and 2
played respectively 99 and 1 time(s) against player 3
I prefer opponent 4
where A
and B
played both 49 times against.
In order to measure the "balanceness" I write the following function:
import numpy as np
from collections import Counter
def balanceness(array: np.ndarray):
classes = [(c, cnt) for c, cnt in Counter(array).items()]
m = len(classes)
n = len(array)
H = -sum([(cnt / n) * np.log((cnt / n)) for c, cnt in classes])
return H / np.log(m)
This functions works as expected:
>> balanceness(array=np.array([0, 0, 0, 1, 1, 1]))
1.0
If I run the function on the different opponents I see the following results:
opponent | balanceness | n_matches |
---|---|---|
C | 1 | 4 |
D | 1 | 2 |
F | 1 | 6 |
G | 0.5032583347756457 | 9 |
Clearly, opponent F
is the most common one. However, the matches of A
and B
against F
are relatively old.
How should I incorporate a recency-factor into my calculation to find the "most recent common opponent"?
Edit
After thinking more about it I decided to weight each match using the following function
def weight(days_ago: int, epilson: float=0.005) -> float:
return np.exp(-1 * days_ago * epilson)
I sum the weight of all the matches against each opponent
opponent | balanceness | n_matches | weighted_n_matches |
---|---|---|---|
C | 1 | 4 | 3.9701246258837 |
D | 1 | 2 | 1.90245884900143 |
F | 1 | 6 | 3.62106362790388 |
G | 0.5032583347756457 | 9 | 8.81753570603108 |
Now, opponent C
is the "most-recent balanced opponent".
Nevertheless, this method ignores the "recentness" on a player-level because we sum the values. There could be a scenario where player 1
played recently a lot of matches against player 3
whereas player 2
faced player 3
in the distant past.
How can we find the opponent that is
Upvotes: 6
Views: 248
Reputation: 1291
Here is a solution based on matrices to calculate all combinations of players, i.e. player
and opponent
. Since I assume that there is no difference between who is listed as player
and opponent
, i.e. ['A', 'C']
is equal to ['C', 'A']
, the matrices are symmetric. Furthermore, I have come up with a slightly different approach to the definition of OPs balanceness
. In my opinion a more robust and also simpler definition is: "The common players of players A and B are those who have the highest non-zero minimum score against players A and B".
import sklearn.metrics
import numpy as np
import pandas as pd
def confusion_matrix(df, *args, **kwargs):
labels = np.unique(df[['player', 'opponent']])
x = sklearn.metrics.confusion_matrix(df['player'], df['opponent'], labels=labels, *args, **kwargs)
return pd.DataFrame(x + x.T, columns=labels, index = labels)
def weight(days_ago: int, epilson: float=0.005) -> float:
return np.exp(-np.abs(epilson) * days_ago)
def common_player_matrix(df_confusion, fill_values = ''):
"""The common player of the players A and B are defined as the
players which have the highest minimum-score (non zero)."""
df_res = pd.DataFrame(columns=df_confusion.columns, index=df_confusion.index)
x, y = np.meshgrid(df_confusion.index, df_confusion.columns)
for i, j in zip(x.flatten(), y.flatten()):
min_score = df_confusion[[i, j]].min(axis=1)
max_of_min = list(min_score[(min_score == min_score.max()) & (min_score>0)].index)
df_res.loc[i,j] = max_of_min if max_of_min != [] else fill_values
return df_res
df = pd.DataFrame({
'player': ['A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B'],
'opponent': ['C', 'D', 'F', 'F', 'F', 'G', 'C', 'C', 'D', 'F', 'F', 'F', 'G', 'G', 'G', 'G', 'G', 'G', 'G', 'G'],
'days_ago': [2, 10, 100, 101, 102, 1, 1, 2, 11, 110, 101, 102, 1, 2, 3, 4, 5, 6, 7, 8]})
# w/o weights, i.e. simple counting the games
df_matrix_n = confusion_matrix(df)
df_matrix_n_common = common_player_matrix(df_matrix_n)
print(df_matrix_n_common)
# A B C D F G
# A [F] [F]
# B [F] [G]
# C [B] [A, B] [B] [B]
# D [A, B] [A, B] [A, B] [A, B]
# F [B] [A, B] [A, B] [B]
# G [B] [A, B] [B] [B]
# including weights - here the OPs weight function
df_matrix_weight = confusion_matrix(df, sample_weight=weight(df['days_ago']))
df_matrix_weight_common = common_player_matrix(df_matrix_weight)
print(df_matrix_weight_common)
# A B C D F G
# A [F] [F]
# B [F] [G]
# C [B] [A] [B] [B]
# D [A] [A] [A] [A]
# F [B] [A] [A] [B]
# G [B] [A] [B] [B]
# the opponent with the most recent matches against the two players
gb_days = df.groupby(['player', 'opponent']).days_ago.min()
df_last_match = confusion_matrix(gb_days.reset_index(), sample_weight=gb_days)
df_last_match_common = common_player_matrix(df_last_match)
print(df_last_match_common)
# A B C D F G
# A [F] [F]
# B [F] [F]
# C [A] [A] [A] [A, B]
# D [A] [B] [B] [A, B]
# F [A] [B] [B] [A, B]
# G [A, B] [A, B] [A, B] [A, B]
players_of_interest = ('A', 'B')
print(f'{players_of_interest} played:')
print(f' - most against {df_matrix_n_common.loc[players_of_interest]}')
print(f' - recently most against {df_matrix_weight_common.loc[players_of_interest]}')
print(f' - most recently against {df_last_match_common.loc[players_of_interest]}')
# ('A', 'B') collectively played:
# - most against ['F']
# - recently most against ['F']
# - most recently against ['F']
As the matrices are symmetric, column and index can be flip in the following. The most frequent player of 'A' (index) and 'B' (column) regardless of the weights is 'F'. The most common player of 'C' (index) and 'D' (column) is with weights 'A' and without weights there are two 'A', 'B'. (column).
Upvotes: 1
Reputation: 4181
First, I think "balanceness" needs to consider how many days ago the matches were played. For example, suppose A and B played 1 match against C, both 100 days ago. Again, let A and B both play 1 match against E, 1 day and 199 days ago respectively. Although the number of matches is the same, their recency is different, and they shouldn't have the same balanceness score.
By using the defined weight(days_ago)
function, it will be as if A and B both played 0.60 matches against C, while they played 0.995 and 0.36 matches against E respectively. These two scenarios should have different balanceness.
Second, just balanceness is obviously not enough. If A and B played 1 match each against D, both 100 years ago, and against E, both 200 years ago---both scenarios are equally "balanced". You need to define a "recency" score (between 0 and 1); I think average weight might work. And then you can combine the two metrics together in some way, e.g. B * R
, or (B * R)/(B + R)
, or alpha * B + (1 - alpha) * R
.
import numpy as np
import pandas as pd
data = [
["A", "C", 2],
["A", "D", 10],
["A", "F", 100],
["A", "F", 101],
["A", "F", 102],
["A", "G", 1],
["B", "C", 1],
["B", "C", 2],
["B", "D", 10],
["B", "F", 100],
["B", "F", 101],
["B", "F", 102],
["B", "G", 1],
["B", "G", 2],
["B", "G", 3],
["B", "G", 4],
["B", "G", 5],
["B", "G", 6],
["B", "G", 7],
["B", "G", 8]
]
def weight(days_ago: int, epilson: float=0.005) -> float:
return np.exp(-1 * days_ago * epilson)
def weighted_balanceness(array: np.ndarray, weights: np.ndarray):
classes = np.unique(array)
cnt = np.array([weights[array == c].sum() for c in classes])
m = len(classes)
n = weights.sum()
H = -(cnt / n * np.log(cnt / n)).sum()
return H / np.log(m)
df = pd.DataFrame(data=data, columns=["player", "opponent", "days_ago"])
df["effective_count"] = weight(df["days_ago"])
scores = []
for opponent in df["opponent"].unique():
df_o = df.loc[df["opponent"] == opponent]
player = np.where(df_o["player"].values == "A", 0, 1)
balanceness = weighted_balanceness(array=player, weights=df_o["effective_count"])
recency = df_o["effective_count"].mean()
scores.append([opponent, balanceness, recency])
df_out = pd.DataFrame(scores, columns=["opponent", "balanceness", "recency"])
df_out["br"] = df_out["balanceness"] * df_out["recency"]
df_out["mean_br"] = 0.5 * df_out["balanceness"] + 0.5 * df_out["recency"]
df_out["harmonic_mean_br"] = df_out["balanceness"] * df_out["recency"] / ( (df_out["balanceness"] + df_out["recency"]))
print(df_out)
This gives me the following:
opponent balanceness recency br mean_br harmonic_mean_br
0 C 0.917739 0.991704 0.910125 0.954721 0.476644
1 D 1.000000 0.951229 0.951229 0.975615 0.487503
2 F 1.000000 0.603511 0.603511 0.801755 0.376368
3 G 0.508437 0.979726 0.498129 0.744082 0.334728
Note that D and F have perfect balanceness. They both played with A & B with same number of matches and same days ago. However, F played a while back (100-102 days ago), so they have a lower recency score, which hurts their combined scores.
Depending on how you combine b and r, most likely D or C would be the best choice (C may win if you give more weight to recency).
Upvotes: 3