Reputation: 11
I am brand new to python and pymc3. So forgive me if this question is rather silly. I have a dataset called toy_data.csv:
batch_no | batch_id | points | pred1 | pred2 |
---|---|---|---|---|
12-150 | 1 | 1 | 70.26 | 10.766 |
12-150 | 1 | 2 | 65.72 | 10.512 |
12-150 | 1 | 6 | 55.44 | 7.78 |
12-150 | 1 | 4 | 63.554 | 12.552 |
12-150 | 1 | 5 | 61.66 | 10.95 |
12-150 | 1 | 3 | 65.44 | 10.808 |
12-190 | 2 | 3 | 55.68 | 6.36 |
12-190 | 2 | 4 | 67.36 | 7.38 |
12-190 | 2 | 1 | 78 | 9.32 |
12-190 | 2 | 2 | 70.12 | 8.554 |
12-190 | 3 | 5 | 60.68 | 9.568 |
12-190 | 3 | 3 | 68.694 | 9.634 |
12-190 | 3 | 2 | 56.118 | 11.474 |
12-190 | 3 | 4 | 58.54 | 9.31 |
12-190 | 3 | 6 | 65.08 | 10.604 |
13-999 | 4 | 6 | 56.73 | 9.754 |
13-999 | 4 | 3 | 55.492 | 9.4 |
13-999 | 4 | 2 | 51.556 | 9.802 |
13-999 | 4 | 1 | 53.748 | 8.85 |
13-999 | 4 | 5 | 59.054 | 9.204 |
13-999 | 4 | 4 | 49.77 | 9.468 |
13-999 | 4 | 7 | 58.266 | 9.954 |
13-999 | 4 | 8 | 57.78 | 9.38 |
14-140 | 5 | 2 | 69.68 | 12.086 |
14-140 | 5 | 1 | 68.5 | 10.438 |
14-140 | 5 | 6 | 71.7 | 11.624 |
14-140 | 5 | 3 | 63.68 | 11.058 |
14-140 | 5 | 4 | 62.02 | 10.498 |
14-140 | 5 | 5 | 61.94 | 10.95 |
14-140 | 5 | 9 | 57.22 | 10.164 |
14-140 | 5 | 7 | 54.44 | 9.798 |
14-140 | 5 | 8 | 60.82 | 10.784 |
14-290 | 6 | 1 | 56.2 | 9.566 |
14-290 | 6 | 1 | 50.06 | 9.23 |
14-290 | 6 | 2 | 50.76 | 10.646 |
14-290 | 6 | 2 | 46.98 | 8.38 |
14-700 | 7 | 2 | 92.8 | 11.532 |
14-700 | 7 | 1 | 81 | 9.522 |
14-700 | 7 | 3 | 75.62 | 10.004 |
14-700 | 8 | 2 | 71.44 | 10.076 |
14-889 | 8 | 1 | 55.2 | 8.1 |
14-889 | 8 | 3 | 71.18 | 9.412 |
14-889 | 8 | 4 | 53.24 | 7.138 |
14-900 | 9 | 3 | 50.08 | 7.744 |
14-900 | 9 | 1 | 47.138 | 8.294 |
14-900 | 9 | 4 | 68.56 | 11.362 |
14-900 | 9 | 1 | 69.34 | 11.584 |
14-900 | 9 | 2 | 63.14 | 10.77 |
I would like to grab the relevant batch & predictor from the hierarchical model trace to plot them and fit the abline line. The final output is to replicate the graph below but instead of the gray range, I'd like to show the abline for each batch.
For example:
Abline = y_best_fit_i=(mu_a+a_i)+[(mu_b+b_i)*pred_i]
with all parameters being backscaled using the mean and std from the respective x variable (pred1 or pred2). Considering only pred1, the formula for backscaling will be mu_a = (mu_a*pred1.std()) + pred1.mean()
; this will be applied to mu_b and all a_i and b_i values. However, backscaling as per pred1 mean and std dev will only apply to the trace from pred1, not pred2 (i.e. it will apply to mu_a, mu_b and a[0,0,0] and b[0,0,0] but not a[0,1,0] or b[0,1,0]). I would like to only apply to mu_a, mu_b , a[i, 0,0] and b[i, 0,0] where i supposed to be batch_no (see table below) and have it automated if possible. The table below is the summary of the tracing.
Would be good if I could find a way to get the numbers from trace itself to plot the final graph. Any help is greatly appreciated.
My code that I used to model is below:
import pandas as pd
import numpy as np
import pymc3 as pm
from sklearn import preprocessing
seed = 68492
np.random.seed(seed)
data = pd.read_csv("toy_data.csv")
# assigning variables
batch_no = data.iloc[:, 0]
batch_id = (data.iloc[:, 1])-1
point = data.iloc[:, 2]
pred = data.iloc[:, 3:5]
pred1 = data.iloc[:, 3]
pred2 = data.iloc[:, 4]
# creating indicators
n_pred = pred.shape[1]
n_batch = len(np.unique(batch_id))
# standarising variable
std_pred = preprocessing.scale(pred)
# Hierarchical model
with pm.Model() as hierarchical_model1:
# Hyperpriors
mu_a = pm.Normal('mu_a', mu=0, sigma=100)
sigma_a = pm.HalfNormal('sigma_a', sigma=100)
mu_b = pm.Normal('mu_b', mu=0, sigma=100)
sigma_b = pm.HalfNormal('sigma_b', sigma=100)
# Intercept for each batch, distributed around group mean mu_a
a = pm.Normal('a', mu=mu_a, sd=sigma_a, shape=(n_batch, n_pred, 1))
# Slope for each batcht, distributed around group mean mu_b
b = pm.Normal('b', mu=mu_b, sd=sigma_b, shape=(n_batch, n_pred, 1))
# Model error
eps = pm.HalfCauchy('eps', beta=5)
# Expected value
omega = (pm.math.dot(std_pred, b))[batch_id,1]
tpoint_est = a[batch_id,1] + omega
# Data likelihood
y_like = pm.Normal('y_like', mu=tpoint_est, sigma=eps, observed=point)
with hierarchical_model1:
hier_trace = pm.sample(draws = 4000, tune = 1000, chains = 2)
hier_out = pm.summary(hier_trace)
Upvotes: 1
Views: 220
Reputation: 8823
Here is what I have, it should solve the batch/pred looping issues and serve as a skeleton for what you actually want to do. I was not able to follow the scalings and rescaling. I also have not been able to understand the 3dr dimension of length one but I have kept it there.
Note that I am using ArviZ (a dependency of PyMC3) which is built on top of xarray
to use label based indexing and automatic broadcasting. And I am actually using the latest features added to ArviZ (development version) and PyMC3 3.11.1. You can install ArviZ development version with pip install git+git://github.com/arviz-devs/arviz.git
.
I started with a couple changes in the model in order to define the shapes of the variables with named dimensions:
coords = {
"batch": np.arange(n_batch),
"pred_num": ["pred1", "pred2"],
"one": [1]
}
with pm.Model(coords=coords) as hierarchical_model1:
...
# Intercept for each batch, distributed around group mean mu_a
a = pm.Normal('a', mu=mu_a, sd=sigma_a, dims=("batch", "pred_num", "one"))
# Slope for each batcht, distributed around group mean mu_b
b = pm.Normal('b', mu=mu_b, sd=sigma_b, dims=("batch", "pred_num", "one"))
...
The rest of the code is exactly the same, except when sampling where I have added the kwarg return_inferencedata=True
.
Once sampled, I have created an xarray dataset from the pred
array in order to take advantage of the automatic broadcast provided by xarray that I mentioned above. As I have set return_inferencedata=True
I already have all the variables
in the model in xarray objects. To create an xarray variable we have to provide
the values, the dimension names and the coordinate variables:
import xarray as xr
pred_ds = xr.DataArray(
pred, dims=("x", "pred_num"), coords={"pred_num": coords["pred_num"]}
).to_dataset(name="pred")
post = hier_trace.posterior
y_ds = (post["mu_a"]+post["a"])+(post["mu_b"]+post["b"])*pred_ds
y_ds # in jupyter you'll see a nice html interactive summary
# here is the plain text output
# <xarray.Dataset>
# Dimensions: (batch: 9, chain: 4, draw: 1000, one: 1, pred_num: 2, x: 48)
# Coordinates:
# * pred_num (pred_num) <U5 'pred1' 'pred2'
# * chain (chain) int64 0 1 2 3
# * draw (draw) int64 0 1 2 3 4 5 6 7 8 ... 992 993 994 995 996 997 998 999
# * batch (batch) int64 0 1 2 3 4 5 6 7 8
# * one (one) int64 1
# Dimensions without coordinates: x
# Data variables:
# pred (chain, draw, batch, pred_num, one, x) float64 1.938e+03 ... 67.98
We can then use ArviZ to loop over the desired variables and generate one subplot for each combination of batch and pred. It looks quite strange though because there is no rescaling at all.
import matplotlib.pyplot as plt
fig, axes = plt.subplots(n_pred, n_batch, figsize=(11,5), sharex=True, sharey="row")
x = np.arange(len(batch_id))
The first step is to create the suplots we will use to plot, and a dummy variable to represent the x.
from arviz.labels import BaseLabeller
labeller = BaseLabeller()
iterator = az.sel_utils.xarray_var_iter(
y_ds.squeeze().transpose("chain", "draw", "x", "pred_num", "batch"),
combined=True,
skip_dims={"x",}
)
We then create this iterator object, the labeller is purely aesthetic (and has detailed docs here). The first step is transposing y_ds
. This is also aesthetic,
if the batch dimension came before, then when plotting we would not see all pred1 on the same row. combined=True
and skip_dims
are indicating that we don't want to iterate over chains nor over values in the x
dimension (by default ArviZ does never iterate over draws. Now we can loop using this iterator to get arrays with
chain, draw, x
dimensions so we can plot the HDI, mean... and get a plot that kind of looks like the example in the question:
for ax, (_, sel, isel, values) in zip(axes.ravel(), iterator):
az.plot_hdi(x, values, ax=ax)
ax.plot(x, values.mean(axis=(0,1)))
ax.set_title(labeller.make_label_vert("y", sel, isel))
fig.tight_layout()
Note that the yaxis is only shared between rows, the scale of pred1 and pred2 are an order of magnitude different even if they look similar with this plot layout.
Side note: I have encountered a lot of divergences, and most rhat values are above the recommended 1.01. I am quite sure that NUTS has not converged, you probably should not trust these results.
Upvotes: 1