Reputation: 7168
Assumptions:
For example:
I have a long list of match outcomes that look like this:
Team A beats Team B
Team B beats Team A
Team A beats Team B
Team C beats Team A
Team A beats Team C
Problem:
Predict the correct betting odds of any team beating any other team.
In the example above, maybe we conclude that A should beat B 66% of the time. That is based off direct observation and is pretty straightforward. However, finding the probability that C beats B seems harder. They've never played together, yet it seems like most likely that C > B, with some low confidence.
Research I've Done:
I've read a fair bit about different ranking systems for games of skill, such as the Elo and Glicko rating systems for Chess. These fall short because they make assumptions about the probability distributions involved. For example, Elo's central assumption was that the chess performance of each player in each game is a normally distributed random variable. However, according to wikipedia, there are other distributions that fit the existing data better.
I don't want to assume a distribution. It seems to me that with 10,000+ match results on hand that I should be able to either deduce the distribution from the evidence (I don't know how to do this), or use some sort of reinforcement learning scheme that doesn't care what the distribution is.
Upvotes: 8
Views: 33466
Reputation: 1230
Ratings Percentage Index - implemented in C# below:
// <copyright file="RPICalculator.cs" company="Steve Stokes Consulting, LLC">
// Copyright © Steve Stokes Consulting, LLC. All Rights Reserved.
// </copyright>
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace RPI.Calculator
{
public class RPICalculator
{
public void Test()
{
//Home Score Away Score
//UConn 64 Kansas 57
//UConn 82 Duke 68
//Minnesota 71 UConn 72
//Kansas 69 UConn 62
//Duke 81 Minnesota 70
//Minnesota 52 Kansas 62
//resulting in the following ratings:
//UConn: 0.6660
//Kansas: 0.6378
//Duke: 0.3840
//Minnesota: 0.3403
List<Team> Teams = new List<Team>();
List<Match> Matches = new List<Match>();
Teams.Add(new Team()
{
Name = "UConn",
});
Teams.Add(new Team()
{
Name = "Kansas",
});
Teams.Add(new Team()
{
Name = "Duke",
});
Teams.Add(new Team()
{
Name = "Minnesota",
});
Matches.Add(new Match()
{
HomeTeam = Teams.Where(t => t.Name == "UConn").First(),
AwayTeam = Teams.Where(t => t.Name == "Kansas").First(),
Winner = Teams.Where(t => t.Name == "UConn").First(),
});
Matches.Add(new Match()
{
HomeTeam = Teams.Where(t => t.Name == "UConn").First(),
AwayTeam = Teams.Where(t => t.Name == "Duke").First(),
Winner = Teams.Where(t => t.Name == "UConn").First(),
});
Matches.Add(new Match()
{
HomeTeam = Teams.Where(t => t.Name == "Minnesota").First(),
AwayTeam = Teams.Where(t => t.Name == "UConn").First(),
Winner = Teams.Where(t => t.Name == "UConn").First(),
});
Matches.Add(new Match()
{
HomeTeam = Teams.Where(t => t.Name == "Kansas").First(),
AwayTeam = Teams.Where(t => t.Name == "UConn").First(),
Winner = Teams.Where(t => t.Name == "Kansas").First(),
});
Matches.Add(new Match()
{
HomeTeam = Teams.Where(t => t.Name == "Duke").First(),
AwayTeam = Teams.Where(t => t.Name == "Minnesota").First(),
Winner = Teams.Where(t => t.Name == "Duke").First(),
});
Matches.Add(new Match()
{
HomeTeam = Teams.Where(t => t.Name == "Minnesota").First(),
AwayTeam = Teams.Where(t => t.Name == "Kansas").First(),
Winner = Teams.Where(t => t.Name == "Kansas").First(),
});
var results = Calculate(Teams, Matches, Sport.NCAA_Basketball);
foreach (var team in results)
{
Debug.WriteLine(string.Format("{0} - {1}", team.Name.PadRight(Teams.Select(t => t.Name).Max(s => s.Length)), team.RPI));
}
}
private decimal defaultHomeWinValue = 1.0m;
private decimal defaultAwayWinValue = 1.0m;
private decimal homeWinValue = 1.0m;
private decimal awayWinValue = 1.0m;
public IEnumerable<TTeam> Calculate<TTeam, TMatch>(IEnumerable<TTeam> Teams, IEnumerable<TMatch> Matches, Sport Sport)
{
// TODO: transform a user team to our internal team type
}
/// <summary>
/// Calculate the RPI of each team
/// </summary>
/// <param name="Teams">The list of teams to calculate RPI for</param>
/// <param name="Matches">The list of matches and results of the matches the teams played in a period</param>
/// <param name="Sport">The sport the teams played - this modifies home/away weighting based on NCAA rules</param>
/// <returns>The list of teams with calculated RPI's</returns>
public IEnumerable<Team> Calculate(IEnumerable<Team> Teams, IEnumerable<Match> Matches, Sport Sport)
{
SetWeightingBasedOnSport(Sport);
foreach (var team in Teams)
{
// calculate the WP of each team
team.WP = CalculateWinPercent(team, Matches, homeWinValue, awayWinValue);
// calculate the OWP of each team
team.OWP = CalculateOpponentsWinPercent(team, Matches);
// calculate the OOWP of each team
team.OOWP = CalculateOpponentsOpponentsWinPercent(team, Teams, Matches);
// calculate the RPI for each team
team.RPI = CalculateRPI(team);
}
return Teams.OrderByDescending(t => t.RPI);
}
private decimal CalculateRPI(Team team)
{
//RPI = (WP * 0.25) + (OWP * 0.50) + (OOWP * 0.25)
//UConn: 0.6660
//Kansas: 0.6378
//Duke: 0.3840
//Minnesota: 0.3403
var RPI = (team.WP * 0.25m) + (team.OWP * 0.50m) + (team.OOWP * 0.25m);
return Math.Round(RPI, 4);
}
private decimal CalculateOpponentsOpponentsWinPercent(Team teamInQuestion, IEnumerable<Team> Teams, IEnumerable<Match> Matches)
{
//UConn: ((Kansas 0.6667) + (Kansas 0.6667) + (Duke 0.3333) + (Minnesota 0.3889)) / (4 games) = 0.5139
//Kansas: ((UConn 0.7500) + (UConn 0.7500) + (Minnesota 0.3889)) / (3 games) = 0.6296
//Duke: ((UConn 0.7500) + (Minnesota 0.3889)) / (2 games) = 0.5694
//Minnesota: ((UConn 0.7500) + (Duke 0.3333) + (Kansas 0.6667)) / (3 games) = 0.5833
// get each team i've played this season (not unique)
var teamsIvePlayed = Matches.Where(m => m.AwayTeam == teamInQuestion || m.HomeTeam == teamInQuestion).Select(s => s.HomeTeam == teamInQuestion ? s.AwayTeam : s.HomeTeam);
// get the opponent win percent (OWP) of each team I played
var teamsIvePlayedOpponentWinPercent = teamsIvePlayed.Select(t => new { Team = t, OWP = CalculateOpponentsWinPercent(t, Matches) });
// calculate the OOWP
return (decimal)(teamsIvePlayedOpponentWinPercent.Sum(t => t.OWP) / teamsIvePlayed.Count());
}
private decimal CalculateOpponentsWinPercent(Team teamInQuestion, IEnumerable<Match> Matches)
{
// get each teams WP without the team in question
//Home Score Away Score
//UConn 64 Kansas 57
//UConn 82 Duke 68
//Minnesota 71 UConn 72
//Kansas 69 UConn 62
//Duke 81 Minnesota 70
//Minnesota 52 Kansas 62
//UConn: ((Kansas 1.0) + (Kansas 1.0) + (Duke 1.0) + (Minnesota 0)) / (4 games) = 0.7500
//Kansas: ((UConn 1.0) + (UConn 1.0) + (Minnesota 0.0)) / (3 games) = 0.6667
//Duke: ((UConn 0.6667) + (Minnesota 0.0)) / (2 games) = 0.3333
//Minnesota: ((UConn 0.6667) + (Duke 0.0) + (Kansas 0.5)) / (3 games) = 0.3889
// get each team i've played this season (not unique)
var teamsIvePlayed = Matches.Where(m => m.AwayTeam == teamInQuestion || m.HomeTeam == teamInQuestion).Select(s => s.HomeTeam == teamInQuestion ? s.AwayTeam : s.HomeTeam);
// get the win percent of each team I played excluding matches with me
var teamsIvePlayedWinPercent = teamsIvePlayed.Select(t => new
{
Team = t,
WP = CalculateWinPercent(t, Matches.Where(m => m.AwayTeam != teamInQuestion && m.HomeTeam != teamInQuestion), defaultHomeWinValue, defaultAwayWinValue)
});
// calculate the OWP
return (decimal)(teamsIvePlayedWinPercent.Sum(t => t.WP) / teamsIvePlayed.Count());
}
private decimal CalculateWinPercent(Team teamInQuestion, IEnumerable<Match> Matches, decimal HomeWinValue, decimal AwayWinValue)
{
// get the teams win percent - sometimes weighted based on NCAA rules
//UConn: (0.6 + 0.6 + 1.4 + 0) / (0.6 + 0.6 + 1.4 + 1.4) = 0.6500
//Kansas: (0 + 0.6 + 1.4) / (1.4 + 0.6 + 1.4) = 0.5882
//Duke: (0 + 0.6) / (1.4 + 0.6) = 0.3000
//Minnesota: (0 + 0 + 0) / (0.6 + 1.4 + 0.6) = 0.0000
// get my wins and sum with weighting
var wins = Matches.Where(m => m.Winner == teamInQuestion).Sum(s => s.HomeTeam == teamInQuestion ? HomeWinValue : AwayWinValue);
// get my games played count weighted
var gamesPlayed = Matches.Where(m => m.HomeTeam == teamInQuestion || m.AwayTeam == teamInQuestion).Sum(s => s.HomeTeam == teamInQuestion ? HomeWinValue : AwayWinValue);
// get the WP
return wins / gamesPlayed;
}
private void SetWeightingBasedOnSport(Sport Sport)
{
switch (Sport)
{
case Sport.NCAA_Basketball:
homeWinValue = 0.6m;
awayWinValue = 1.4m;
break;
case Sport.NCAA_Baseball:
homeWinValue = 0.7m;
awayWinValue = 1.3m;
break;
default:
homeWinValue = defaultHomeWinValue;
awayWinValue = defaultAwayWinValue;
break;
}
}
}
public enum Sport
{
NoHomeOrAwayWeighting = 1,
NCAA_Basketball = 2,
NCAA_Baseball = 3,
}
public class Team
{
public string Name { get; set; }
public decimal RPI { get; internal set; }
public decimal WP { get; internal set; }
public decimal OWP { get; internal set; }
public decimal OOWP { get; internal set; }
}
public class Match
{
public Team HomeTeam { get; set; }
public Team AwayTeam { get; set; }
public Team Winner { get; set; }
}
}
Upvotes: 2
Reputation: 31604
We need to make some assumptions, as can be seen from the following example:
Team Rock beats Team Scissors
Team Paper beats Team Rock
Team Rock beats Team Scissors
Now we have a fight between Team Scissors and Team Paper. Since Team Paper has beaten Team Rock which again has beaten team Scissors twice, we might assume that the best odds are on Paper beating Scissors.
However in the above we have assumed a transitive model, which clearly isn't applicable. It may be a better fit for some sports like football, but still not exactly.
What ELO does is to assume that all teams have some 'inherent strength' on a scale from 0
to infinity
. It should be obvious that no skills really work like this, but it turns out to be a useful model for predicting games. Also notice how this model doesn't work well for the Rock, Paper, Scissors game.
The next assumption made in chess is that the absolute difference in 'inherent strength' creates a probability distribution on the likelihood of one player beating another. Again, this is clearly not true, as things like having the white/black pieces also play in. It can however be made more precise (shows the evidence) by looking at the winning chances over multiple games.
With the above two assumptions we can calculate winning chances, and if the model turns out to be a good fit for the game, they may be reasonably precise. Unfortunately this sort of modelling is not an exact science, and no matter what assumptions you make, you can always find a game/situation where they don't apply.
I hope this gave you some inspiration for coming up with more assumptions for your model :) Once you have them, you can test how well it works, by seeing if it can predict some of the results you already have in your training set. There is a whole world of machine learning and statistics literature out there for you to enjoy :)
Upvotes: 2
Reputation: 26268
This problem can be solved with directed graphs.
Let all the teams be vertices, and directed edge between team1 and team2 means that team1 beat the team2.
After that is done, you can divide the graph into strongly connected components, and work with each connected component independently, as they are statistically independent. Or are they? Wink
The question we have to ask, what is the chance of team1 beating team2?
This is easy to answer, if the teams you're comparing, have direct matches between them. In this case, you care only about direct matches; eg how many times has team1 beaten team2? The way you answer the question;
(team1WinsAgainstTeam2)/(matchesPlayedBetweenThem)
Is this correct? Should the chances decrease for teamA when we know that teamB has played with teamX and WON 100 times, but teamX has always beaten teamA? If yes, discard everything what I've said in this post :-)
The final algorithm should look something like this:
double getOddsTeam1Winning(int team1, int team2){
if(!isInSameConnectedComponent(team1, team2)){
// if two teams are not in the same
// connected component,
// we can use heuristic,
// we'll compare how many matches has team1 won,
// compared to team2.
var team1Wins = (team1Wins - team1Loses);
var team2Wins = (team2Wins - team2Loses);
return team1Wins / (team1Wins + team2Wins);
}
if(isDirectMatchBetween(team1, team2)){
return team1WinsOverTeam2/
totalMatchesPlayedBetweenTeam1AndTeam2;
}
List<double> odds= new List<double>();
foreach(var opponentTeam in teamsThatTeam1HasPlayedWith){
var oddsThatOponentTeamBeatsTeam2 = getOddsTeam1Winning(opponentTeam, team2);
var oddsThatTeam1BeatsOpponentTeam = getOddsTeam1Winning(team1, opponentTeam);
// combine them & push them to odds list
}
return odds.Average(); // taking average of odds?!
}
ps,I put this together in few minutes, not entirely sure if it's correct mathematically, but I think that will solve the problem in your original problem you've listed, atleast one instance of it :p.
Upvotes: 0
Reputation: 606
Why dont you just use the Ratings Percentage Index from wikipedia. You can find it better explain there, but as an fast introduction you use the following formula:
RPI = (WP * 0.25) + (OWP * 0.50) + (OOWP * 0.25)
WP: Winning Percentage wins / games_played
OWP: Calculated by taking the average of the WP's for each of the team's opponents with the requirement that all games against the team in question are removed from the calculation
OOWP: Average of each Opponent's OWP.
This problem was also used by the Google Code Jam.
Hope the algortihm can help you.
All the thanks to Wikipedia.
Upvotes: 2
Reputation: 4366
You want to make a best estimate of a probability (or multiple probabilities) and continuously update your estimate as more data become available. That calls for Bayesian inference! Bayesian reasoning is based on the observation that the probability (distribution) of two things, A and B, being the case at the same time is equal to the probability (distribution) of A being the case given that B is the case times the probability that B is the case. In formula form:
P(A,B) = P(A|B)P(B)
and also
P(A,B) = P(B|A)P(A)
and hence
P(A|B)P(B) = P(B|A)P(A)
Take P(B) to the other side and we get the Bayesian update rule:
P(A|B)' = P(B|A)P(A)/P(B)
Usually A stands for whatever variable you are trying to estimate (e.g. "team x beats team y") while B stands for your observations (e.g. the full history of matches won and lost between teams). I wrote the prime (i.e. the quote in P(A|B)') to signify that the left hand of the equation represents an update of your beliefs. To make it concrete, your new estimate of the probability that team x will beat team y, given all observations so far, is the probability of doing those observations given your previous estimate, times your previous estimate, divided by the overall probability of seeing the observations you have seen (i.e. given no assumptions about relative strength between teams; one team winning most of the time is less likely than both teams winning about equally often).
The P(A|B)' from the left hand of the current update becomes the new P(A) on the right hand of the next update. You just keep repeating this as more data come in. Typically, in order to be as unbiased as possible, you start with a completely flat distribution for P(A). Over time P(A) will become more and more certain, although the algorithm is fairly well able to deal with sudden changes of the underlying probability that you're trying to estimate (e.g. if team x suddenly becomes much stronger because a new player joins the team).
The good news is that Bayesian inference works well with the beta distribution which ElKamina also mentioned. In fact the two are often combined in artificial intelligence systems that are meant to learn a probability distribution. While the beta distribution in itself is still an assumption, it has the advantage that it can take many forms (including completely flat and extremely spikey), so there's relatively little reason to be concerned that your choice of distribution might be affecting your outcome.
One piece of bad news is that you still need to make assumptions, apart from the beta distribution. For example, suppose you have the following variables:
A: team x beats team y
B: team y beats team z
C: team x beats team z
and you have observations from direct matches between x and y and from matches between y and z but not from matches between x and z. A simple (though naieve) way to estimate P(C) could be to assume transitivity:
P(C) = P(A)P(B)
Regardless how sophisticated your approach, you'll have to define some kind of structure of probabilities to deal with the gaps as well as the interdependencies in your data. Whatever structure you choose, it will always be an assumption.
Another piece of bad news is that this approach is plain complicated and I cannot give you a full account of how to apply it to your problem. Given that you need a structure of interdependent probabilities (probability of team x beating team y given other distributions involving teams x, y and z), you may want to use a Bayesian network or related analysis (for example a Markov random field or path analysis).
I hope this helps. In any case, feel free to ask for clarifications.
Upvotes: 6
Reputation: 7807
===Step 1===
Let us say two teams A and B played n matches with each other and A won m times. By applying a flat beta distribution, the probability of A winning next time is: (m+1)/(n+2).
As you can see, if m and n are large numbers, it is roughly equal to m/n.
===Step 2===
In your case I suggest the following strategy.
Let m = mp + md + mr and n = np + nd + nr
Suffixes p means prior, d means direct and r means indirect.
You can set mp and np to 1 and 2 respectively (assuming flat prior) or in an advanced fashion (detailed at the end)
md and nd are wins and games.
mr and nr are calculated with some strategy.
In the end probability of A winning is (mp+md+mr)/(np+nd+nr).
===Step 3===
How to calculate mr and nr:
You can use some kind of dampening. Eg. If A def C and C def B, count it as p wins for A against B. For longer chains, use exponential decay.
Optimal value of p can be calculated using cross validation where you leave out certain part of the data and use p that maximizes probability of that left-out data. For your particular problem I suggest leaving out games between a pair, estimated probability and comparing to actual values.
You can use: k*log^2(s/t) as penalty where k is the number of games between left-out pair A and B, s is the predicted and t the actual probability of A winning. You can also use some thing like KL divergence.
===Step 4===
Revisit setting mp and np.
You need to have lots of multiple matchups between same teams for this to work out.
For every pair of teams calculate the probability of winning and plot it. If it looks flat, using 1 and 2 as mp and np is fine. Otherwise go through http://en.wikipedia.org/wiki/Beta_distribution and pick the distribution that best matches.
Upvotes: 0
Reputation: 19601
You could try applying the http://en.wikipedia.org/wiki/Elo_rating_system, mostly on the basis that other people have used this for working out strengths. However, any such system is really based on some sort of probability model of what is really going on, and you would be better trying to come up with one for your particular game. For instance, for soccer, one approach is to model the number of goals scored by a team as a Poisson process which depends on the strength of their offense and the other side's defense. Once you have a model you can fit it to the data, for instance by maximum likelihood.
As an example of the model making a difference, look at http://en.wikipedia.org/wiki/Nontransitive_dice. This is a simple example of a situation in which A usually beats B, B usually beats C, and C usually beats A, which is not what you expect given a simple one-dimensional strength system.
Upvotes: 0