Reputation: 615
My data looks something like this:
Player | Avg_goals | Minutes played
A | 10 | 100
B | 12.1 | 900
C | 15 | 1600
D | 8.3 | 3200
E | 3 | 750
...
Z | 2.4 | 420
is to model and get the true avg_goals
for any player where the source of uncertainty is the Minutes
values.
As in, I'm more certain about the true rate being closer to the recorded Avg_goals
where the Minutes
is high while for the players who have less minutes, there's greater variance and greater uncertainty about their true rate being closer to the value in the Avg_goals
column. So I'm more sure about Player D's rate being closer to 8.3 than Player Z's (recorded value=2.4) because of the number of minutes.
I'm uncertain about how to depict this relationship between Avg_goals
and Minutes
in a PyMC3 model.
I've decided to use a Poisson prior for the goals column but after that I've no idea how to proceed. My (incomplete) code so far is
import pymc3 as pm
import numpy as np
minutes = np.array([100, 900, 1600, 3200, 750])
goals_ = np.array([10, 12.1, 15, 8.3, 3])
with pm.Model() as model:
lambda = pm.Normal('lambda', goals_.mean())
goals_ = pm.Poisson('goals_', lambda)
###NO IDEA WHAT COMES NEXT??!!
Any help would be appreciated. If I can get similar examples implemented in PyMC3, that would be great.
Upvotes: 1
Views: 141
Reputation: 76760
Without actual replicates, how uncertainty scales with time will be arbitrary. However, you are correct that one can at least observe relative relationships of uncertainty. Just that the absolute units of uncertainty will be dependent on choices made in the priors.
For sake of model simplicity, I would suggest working in minutes_played
and total_goals
, rather than the variables in OP. We can still specify to have a goal_rate
output, but it will be a deterministic function of model parameters. It should be noted that some of the entries given in OP data don't make any sense. For example, there is no whole number of goals that a player could have made in playing for 1600 minutes that would lead to a goal rate of 15 goals/90 mins. Therefore, I changed the data to make it valid.
As a first stab, I'll propose a binomial regression model, where we model the total_goals
a player has made as a binomial random variable with N corresponding to the minutes_played
and the rate being a player-specific goals per minute. The regression part is that we will assume all players have a shared mean goal rate, and we will infer a player-specific coefficient that defines their deviation from the mean.
This isn't an exact model. For example, it assumes that all players have a rate of scoring between 0 and 1 goals per minute (GPM). While there is nothing logically impossible about a GPM of over 1, practically it is implausible, so I think this model is not unreasonable.
minutes_played = np.array([10, 90, 750, 900, 1800, 3600])
goals_per_game = np.array([18, 10, 3, 12.1, 15, 8.3])
total_goals = goals_per_game * minutes_played / 90
n_players = len(total_goals)
with pm.Model() as model:
# regression model coefficients
c_player = pm.Normal('c_player', 0, tau=1, shape=n_players)
c_mu = pm.Normal('c_mu', 0, 10)
# goals per minute (by player)
gpm = pm.math.invlogit(c_mu + c_player)
# intuitive variables
pm.Deterministic('avg_goal_rate', pm.math.invlogit(c_mu)*90)
pm.Deterministic('goal_rate', gpm*90)
# log-likelihood
pm.Binomial('llik', n=minutes_played, p=gpm, observed=total_goals)
trace = pm.sample()
This seems to sample without much issue:
and the posterior distributions on the "intuitive" variables generally reflect what one expects, namely, higher uncertainty on players with fewer minutes:
pm.plot_forest(trace, var_names=['avg_goal_rate', 'goal_rate'])
In this model, the parameter that regulates the scale on the uncertainty is the tau
argument in c_player = pm.Normal('c_player', 0, tau=1, shape=n_players)
. In terms of the model, this precision modulates how plausible it is that a player truly deviates from the average goal rate; higher precisions imply lower plausibility. I suggest playing around with this value (e.g., 0.1, 10) to see how it changes the uncertainty around each player's goal rate.
Upvotes: 1