Reputation: 3549

ESports Match Predictor (Machine Learning)

I think I'm starting to get the hang of some of the basics for machine learning, I've already completed three tutorial projects for stock price predicting, wine rating predictor, and a money ball (mlb) project. Now I am looking to do something a little more advanced with a personal project, but I can't seem to find information on how I would go about it.

I watch quite a bit of ESports and I found a game that has an extensive API with statistics of every rated match played, and I wanted to see if I could put together a match predictor.

The game is a 3v3 game were the users pick an Avatar (about 50ish in total) and battle each other on a specific map (about 10 in total). I think the important details to accomplish this would be the Avatars on each team, the position/role that the avatar plays, the map that was played, the player's personal rank, and what the final score was.

I pulled down 500k matches to train my data on and I created my "match objects" look like this -

{
  "map": "SomeMap",
  "team1score": 1,
  "team2score": 4
  "team1": [
    {"avatar": "whatever1", "position": "role1", "playerRank": 14},
    {"avatar": "whatever2", "position": "role2", "playerRank": 10},
    {"avatar": "whatever3", "position": "role4", "playerRank": 11},
  ], 
  "team2": [
    {"avatar": "whatever4", "position": "role2", "playerRank": 11},
    {"avatar": "whatever5", "position": "role3", "playerRank": 12},
    {"avatar": "whatever6", "position": "role1", "playerRank": 11},
  ]
}

In the object above, there is the map, the team score, the avatars on each team, the role of each avatar, and the person's online rank.

I think where I am struggling is that there are more dimensions than what I did in the previous projects. How I originally started to code this was using a flat table like -

avatar1, avatar1_rank, avatar1_tier, avatar2, avatar2_rank, avatar2_tier, etc

but I quickly discovered in doing that, it was giving weight to where "on the team" the avatar was (index 1, index 2, or index 3) when it shouldn't matter where on the team the avatar is, just that he is on the team. Ordering alphabetically doesn't solve this either as there are 50 or so avatars to choose from, so avatars are pretty much just as likely to be in position 1 as position 3. So essentially I am looking to do an analysis on two teams of players, where the stats for the players are given, and the players could appear in any order on the team (1-3).

So my questions is - does anyone have any code or know of any similar sample projects?

Thank you all.

Upvotes: 2

Answers (2)

Alexander L. Hayes

Reputation: 4273

I've very loosely followed some "machine learning for League of Legends match prediction." The game you're describing sounds similar.

Encoding Problems: Relational vs. Non-Relational

The core problem is that you have relational data, and mainstream machine learning usually operates on non-relational data (or: "a flat table").

This:

"team1": [
    {"avatar": "whatever1", "position": "role1", "playerRank": 14},
    {"avatar": "whatever2", "position": "role2", "playerRank": 10},
    {"avatar": "whatever3", "position": "role4", "playerRank": 11},
]

... implies a JOIN happened behind-the-scenes to select the players that were present in a match.

We first have to transform from a relational representation into a flat feature vector. Some of these can be treated as categorical variables, and a one-hot encoding can represent the presence/absence of an avatar, role, or map:

import pandas as pd
import numpy as np

# Categorical variables
maps = ["Map1", "Map2", "Map3"]
avatars = ["A1", "A2", "A3", "A4", "A5", "A6"]
roles = ["ADC", "Top", "Support", "Mid", "Jungle"]

# Create variables for each team, using "Team 1" and "Team 2"
t1_avatars = ["t1_" + a for a in avatars]
t2_avatars = ["t2_" + a for a in avatars]
t1_roles = ["t1_" + r for r in roles]
t2_roles = ["t2_" + r for r in roles]

# Possible one-hot-encoding for categorical variables
n_columns = sum(len(l) for l in (maps, t1_avatars, t2_avatars, t1_roles, t2_roles))
df = pd.DataFrame(
    np.zeros((1, n_columns)),
    columns=maps + t1_avatars + t2_avatars + t1_roles + t2_roles,
)

   Map1  Map2  Map3  t1_A1  ...  t2_Top  t2_Support  t2_Mid  t2_Jungle
0   0.0   0.0   0.0    0.0  ...     0.0         0.0     0.0        0.0

And for the data here, the non-zero columns might look like this:

   Map1  t1_A1  t1_A2  t1_A3  t2_A4  t2_A5  t2_A6  t1_ADC  t1_Top  t1_Jungle  t2_ADC  t2_Top  t2_Support
0     1      1      1      1      1      1      1       1       1          1       1       1           1

For numeric features you might try simple summary statistics, like the sum of player ranks or average player ranks. Most summary statistics are invariant under permutation (i.e. sum(1, 2, 3) == sum(3, 2, 1)) so the feature is independent of the order you observe players in:

>>> sum(p['playerRank'] for p in data['team1'])
35
>>> sum(p['playerRank'] for p in data['team2'])
34

Existing Approaches

I'll discuss two cases. Neither addresses this problem.

The Medium post describes using numeric features (total team deaths, total team kills, total minions cleared) prior to 10 minutes. i.e.: it avoids relational features.

The arxiv paper uses feature engineering to estimate "how good a player is at playing a champion." I suspect they ran into this same problem, because they suggest in section 3.2:

"For each match, we derived additional features to better reflect each team's overall player-champion experience. For both teams, we added the team's average, median, coefficient of excess kurtosis (i.e., how normal the distribution is), coefficient of skewness, standard deviation, and variance of the player's champion win rate and of their champion mastery points. The final training dataset contained 44 features per match (2 features per player for 10 players and 12 features per team)."

That's a little vague. I suspect this paper used an arbitrary order, so the vector of player skills might be:

[1.0, 0.5, 0.3, 1.2, 1.1]

... even though there are 120 valid ways to permute this sequence.

Or they might be assuming players always appear in a fixed order based on the role they played: [Top, Jungle, Mid, Support, Bottom]. There are five roles in LoL, and ranked queue/tournaments often display players in this order.

Side note: I recently answered a question about training neural networks on problems with interchangeable objects. The idea is to permute interchangeable variables at learning time, and sample from possible permutations at prediction time. However: the complexity is factorial in the number of variables.

Does anything handle relational features directly?

It's an open research problem!

Flattening from a relational to non-relational representation is lossy. If you want to get things working with statistical learning methods: aggregate, aggregate, aggregate. There's evidence that you can approximate with feature engineering and aggregation (e.g. Perlich and Provost 2006)

For something more specific:

An Introduction to Lifted Probabilistic Inference by Guy Van den Broeck, Kristian Kersting, Sriraam Natarajan, and David Poole; is a recent treatment of advances on this topic.
I maintain a project called srlearn (for: statistical-relational-learn) that tries to work with relational data directly—but it has some major caveats like not handling continuous features.

Upvotes: 1

mrk

Reputation: 10396

Let's approach this from an architectural perspective. Your main problem is..

but I quickly discovered in doing that, it was giving weight to where "on the team" the avatar was (index 1, index 2, or index 3) when it shouldn't matter where on the team the avatar is, just that he is on the team.

There are two potential solutions for this I can think of right away Let's assume avatars A, B, C, D, E, ..

Prior Encoding (which I would recommend)

You could encode your teams into a single configuration specific value, which you would then hand over instead.

So you would map (A, B, C) and (A, C, B) and (B, A, C) and (B, C, A) and (C, A, B) and (C, B, A) to for example "Team A". This value would then carry the information that avatars A, B, and C have been on the team. This encoding would have to be done for each possible team. A simple look up table for the encoding will do. Obviously this encoding could also be left with another auxiliary model or layer for that matter.

Then you can just go ahead an train your model as usual.

Augmentation Interpretation

You could also take the stand that your model might as well just find out that the position does not influence the effect of a certain avatar. To support this learning, you would hand over samples with each potential configuration of the team (see possible combinations above) to the model while training - and at the same time keeping all other input values and the ground truth constant. This increases the training effort, does not necessarily mean that your model might do the job as well as you would like and is a pain for your data pipeline.