How to measure Bayesian uncertainty of a value based on another value in PyMC3

Question

My data looks something like this:

Player | Avg_goals  | Minutes played

A      | 10         | 100
B      | 12.1       | 900
C      | 15         | 1600
D      | 8.3        | 3200
E      | 3          | 750
...
Z      | 2.4        | 420

What I want to do

is to model and get the true avg_goals for any player where the source of uncertainty is the Minutes values.

As in, I'm more certain about the true rate being closer to the recorded Avg_goals where the Minutes is high while for the players who have less minutes, there's greater variance and greater uncertainty about their true rate being closer to the value in the Avg_goals column. So I'm more sure about Player D's rate being closer to 8.3 than Player Z's (recorded value=2.4) because of the number of minutes.

The Issue

I'm uncertain about how to depict this relationship between Avg_goals and Minutes in a PyMC3 model. I've decided to use a Poisson prior for the goals column but after that I've no idea how to proceed. My (incomplete) code so far is

import pymc3 as pm
import numpy as np

minutes = np.array([100, 900, 1600, 3200, 750])
goals_ = np.array([10, 12.1, 15, 8.3, 3])

with pm.Model() as model:
    lambda = pm.Normal('lambda', goals_.mean())
    goals_ = pm.Poisson('goals_', lambda) 
    
    ###NO IDEA WHAT COMES NEXT??!!

Any help would be appreciated. If I can get similar examples implemented in PyMC3, that would be great.

merv · Accepted Answer

Without actual replicates, how uncertainty scales with time will be arbitrary. However, you are correct that one can at least observe relative relationships of uncertainty. Just that the absolute units of uncertainty will be dependent on choices made in the priors.

For sake of model simplicity, I would suggest working in minutes_played and total_goals, rather than the variables in OP. We can still specify to have a goal_rate output, but it will be a deterministic function of model parameters. It should be noted that some of the entries given in OP data don't make any sense. For example, there is no whole number of goals that a player could have made in playing for 1600 minutes that would lead to a goal rate of 15 goals/90 mins. Therefore, I changed the data to make it valid.

Binomial Regression Model

As a first stab, I'll propose a binomial regression model, where we model the total_goals a player has made as a binomial random variable with N corresponding to the minutes_played and the rate being a player-specific goals per minute. The regression part is that we will assume all players have a shared mean goal rate, and we will infer a player-specific coefficient that defines their deviation from the mean.

This isn't an exact model. For example, it assumes that all players have a rate of scoring between 0 and 1 goals per minute (GPM). While there is nothing logically impossible about a GPM of over 1, practically it is implausible, so I think this model is not unreasonable.

Data

minutes_played = np.array([10, 90, 750, 900, 1800, 3600])
goals_per_game = np.array([18, 10, 3, 12.1, 15, 8.3])
total_goals = goals_per_game * minutes_played / 90

n_players = len(total_goals)

Model

with pm.Model() as model:
    # regression model coefficients
    c_player = pm.Normal('c_player', 0, tau=1, shape=n_players)
    c_mu = pm.Normal('c_mu', 0, 10)

    # goals per minute (by player)
    gpm = pm.math.invlogit(c_mu + c_player)

    # intuitive variables
    pm.Deterministic('avg_goal_rate', pm.math.invlogit(c_mu)*90)
    pm.Deterministic('goal_rate', gpm*90)

    # log-likelihood
    pm.Binomial('llik', n=minutes_played, p=gpm, observed=total_goals)

    trace = pm.sample()

Results

This seems to sample without much issue:

enter image description here

and the posterior distributions on the "intuitive" variables generally reflect what one expects, namely, higher uncertainty on players with fewer minutes:

pm.plot_forest(trace, var_names=['avg_goal_rate', 'goal_rate'])

enter image description here

Scaling Uncertainty

In this model, the parameter that regulates the scale on the uncertainty is the tau argument in c_player = pm.Normal('c_player', 0, tau=1, shape=n_players). In terms of the model, this precision modulates how plausible it is that a player truly deviates from the average goal rate; higher precisions imply lower plausibility. I suggest playing around with this value (e.g., 0.1, 10) to see how it changes the uncertainty around each player's goal rate.