Reputation: 59
I was working on a simple Bayesian linear regression using PyMC3 in python. In defining the likelihood function I came across this syntax.
likelihood = pm.Normal('Y', mu=intercept + x_coeff * df['x'],sd=sigma, observed=df['y'])
In the parameters for pm.Normal(), what does the "observed = " do? Please explain with examples if possible.
Upvotes: 1
Views: 3044
Reputation: 116
Let's say you specify a probabilistic model, using a set of distributions and their relation to each other. In the below case we assume the random variable y
follows a normal distribution with a mean we are unsure about, yet we have a belief its mean is positive and values close to 0 are more likely than greater values. We model this prior belief with a half normal distribution.
import numpy as np
import pymc3 as pm
import arviz as az
# specifying the model
# all random variables are unobserved
model = pm.Model()
with model:
mean = pm.HalfNormal("mean", sigma=1)
y = pm.Normal("y", mu=mean, sigma=1)
Let's now draw random samples from this model and visualise them:
with model:
tr = pm.sample(1000, tune=5000, chains=2, cores=8, return_inferencedata=True)
with model:
az.plot_trace(tr, var_names=["mean", "y"], divergences=None, legend=True, compact=False)
When you inspect the output of above cell you will see the samples for mean
we drew follow the half normal distribution we specified, y
accordingly shows a positive skew.
Now let's create some data:
y_obs = np.random.normal(loc=2, scale=1.0, size=1000)
As you can see the data we created follows a normal distribution with a mean of 2 (loc=2
), quite different from our prior belief.
Now let's run our model again feeding it with our actual observations of random variable y
.
model2 = pm.Model()
with model2:
mean = pm.HalfNormal("mean", sigma=1)
y = pm.Normal("y", mu=mean, sigma=1, observed=y_obs)
tr = pm.sample(1000, tune=5000, chains=2, cores=8, return_inferencedata=True)
Visualise the samples:
with model:
az.plot_trace(tr2, var_names=["mean"], divergences=None, legend=True, compact=False)
When you run this code you will see that the distribution we assume for mean has changed quite drastically. It is now tightly centered around 2, and seems to be rather normally distributed.
What happened is, that by providing data to the model observed = y_obs
you specify a likelihood for all possible outcomes of y
, by applying Bayes' theorem, the algorithm can infer the posterior distribution for the true value of mean
.
Upvotes: 0
Reputation: 5722
observed
means that the value of the linear regression's response variable (typically named "y", but here confusingly named likelihood
) is known (through observation) to be equal to df[y]
.
When the inference algorithm is run, values df['y']
will be used to determine the likely values of stochastic variables intercept
and x_coeff
that would have caused them. To do that, it uses the causal relationship between them, namely that the observed variable is Normally-distributed with mean equal to intercept + x_coeff*df['x']
and standard deviation sigma
.
Note that df['y']
is typically an array with multiple observations. So the algorithm will try to infer the distributions of intercept
and x_coeff
likely to have induced these multiple observations df['y']
.
Note that the algorithm will not infer the values for df['x']
since that is also fixed, observed data.
I mentioned the variable was confusingly named likelihood
instead of y
. That is because pm.Normal
does create a stochastic variable object, not a real-valued likelihood. I believe the reason this name was chosen was tradition, because the observed values define a likelihood that is internally used by the inference algorithm to infer the distributions for the other stochastic variables.
In fact, in the PyMC introduction we see a similar definition using the name Y_obs
instead:
Y_obs = pm.Normal("Y_obs", mu=mu, sigma=sigma, observed=Y)
Upvotes: 1