clueless
clueless

Reputation: 11

plot multiple scatterplots from hierarchical model with 2 predictors

I am brand new to python and pymc3. So forgive me if this question is rather silly. I have a dataset called toy_data.csv:

batch_no batch_id points pred1 pred2
12-150 1 1 70.26 10.766
12-150 1 2 65.72 10.512
12-150 1 6 55.44 7.78
12-150 1 4 63.554 12.552
12-150 1 5 61.66 10.95
12-150 1 3 65.44 10.808
12-190 2 3 55.68 6.36
12-190 2 4 67.36 7.38
12-190 2 1 78 9.32
12-190 2 2 70.12 8.554
12-190 3 5 60.68 9.568
12-190 3 3 68.694 9.634
12-190 3 2 56.118 11.474
12-190 3 4 58.54 9.31
12-190 3 6 65.08 10.604
13-999 4 6 56.73 9.754
13-999 4 3 55.492 9.4
13-999 4 2 51.556 9.802
13-999 4 1 53.748 8.85
13-999 4 5 59.054 9.204
13-999 4 4 49.77 9.468
13-999 4 7 58.266 9.954
13-999 4 8 57.78 9.38
14-140 5 2 69.68 12.086
14-140 5 1 68.5 10.438
14-140 5 6 71.7 11.624
14-140 5 3 63.68 11.058
14-140 5 4 62.02 10.498
14-140 5 5 61.94 10.95
14-140 5 9 57.22 10.164
14-140 5 7 54.44 9.798
14-140 5 8 60.82 10.784
14-290 6 1 56.2 9.566
14-290 6 1 50.06 9.23
14-290 6 2 50.76 10.646
14-290 6 2 46.98 8.38
14-700 7 2 92.8 11.532
14-700 7 1 81 9.522
14-700 7 3 75.62 10.004
14-700 8 2 71.44 10.076
14-889 8 1 55.2 8.1
14-889 8 3 71.18 9.412
14-889 8 4 53.24 7.138
14-900 9 3 50.08 7.744
14-900 9 1 47.138 8.294
14-900 9 4 68.56 11.362
14-900 9 1 69.34 11.584
14-900 9 2 63.14 10.77

I would like to grab the relevant batch & predictor from the hierarchical model trace to plot them and fit the abline line. The final output is to replicate the graph below but instead of the gray range, I'd like to show the abline for each batch.

Ideal output (sourced from github)

For example:

Abline = y_best_fit_i=(mu_a+a_i)+[(mu_b+b_i)*pred_i] with all parameters being backscaled using the mean and std from the respective x variable (pred1 or pred2). Considering only pred1, the formula for backscaling will be mu_a = (mu_a*pred1.std()) + pred1.mean()​​​; this will be applied to mu_b and all a_i and b_i values. However, backscaling as per pred1 mean and std dev will only apply to the trace from pred1, not pred2 (i.e. it will apply to mu_a, mu_b and a[0,0,0] and b[0,0,0] but not a[0,1,0] or b[0,1,0]). I would like to only apply to mu_a, mu_b , a[i, 0,0] and b[i, 0,0] where i supposed to be batch_no (see table below) and have it automated if possible. The table below is the summary of the tracing.

Summary Table

Would be good if I could find a way to get the numbers from trace itself to plot the final graph. Any help is greatly appreciated.

My code that I used to model is below:

import pandas as pd
import numpy as np
import pymc3 as pm
from sklearn import preprocessing

seed = 68492
np.random.seed(seed)

data = pd.read_csv("toy_data.csv")    

# assigning variables
batch_no = data.iloc[:, 0]
batch_id = (data.iloc[:, 1])-1
point = data.iloc[:, 2]
pred = data.iloc[:, 3:5]
pred1 = data.iloc[:, 3]
pred2 = data.iloc[:, 4]


# creating indicators
n_pred = pred.shape[1]
n_batch = len(np.unique(batch_id))

# standarising variable
std_pred = preprocessing.scale(pred)


# Hierarchical model 
with pm.Model() as hierarchical_model1:

    # Hyperpriors
    mu_a = pm.Normal('mu_a', mu=0, sigma=100)
    sigma_a = pm.HalfNormal('sigma_a', sigma=100)
    mu_b = pm.Normal('mu_b', mu=0, sigma=100)
    sigma_b = pm.HalfNormal('sigma_b', sigma=100)   
    

    # Intercept for each batch, distributed around group mean mu_a
    a = pm.Normal('a', mu=mu_a, sd=sigma_a, shape=(n_batch, n_pred, 1))
    # Slope for each batcht, distributed around group mean mu_b
    b = pm.Normal('b', mu=mu_b, sd=sigma_b, shape=(n_batch, n_pred, 1))
    
    # Model error
    eps = pm.HalfCauchy('eps', beta=5)
    
        
    # Expected value
    omega = (pm.math.dot(std_pred, b))[batch_id,1]
    tpoint_est = a[batch_id,1] + omega

         
    # Data likelihood
    y_like = pm.Normal('y_like', mu=tpoint_est, sigma=eps, observed=point)
    

with hierarchical_model1:
    hier_trace = pm.sample(draws = 4000, tune = 1000, chains = 2)
    hier_out = pm.summary(hier_trace)

Upvotes: 1

Views: 220

Answers (1)

OriolAbril
OriolAbril

Reputation: 8823

Here is what I have, it should solve the batch/pred looping issues and serve as a skeleton for what you actually want to do. I was not able to follow the scalings and rescaling. I also have not been able to understand the 3dr dimension of length one but I have kept it there.

Note that I am using ArviZ (a dependency of PyMC3) which is built on top of xarray to use label based indexing and automatic broadcasting. And I am actually using the latest features added to ArviZ (development version) and PyMC3 3.11.1. You can install ArviZ development version with pip install git+git://github.com/arviz-devs/arviz.git.

I started with a couple changes in the model in order to define the shapes of the variables with named dimensions:

coords = {
    "batch": np.arange(n_batch),
    "pred_num": ["pred1", "pred2"],
    "one": [1]
}
with pm.Model(coords=coords) as hierarchical_model1:
    ...

    # Intercept for each batch, distributed around group mean mu_a
    a = pm.Normal('a', mu=mu_a, sd=sigma_a, dims=("batch", "pred_num", "one"))
    # Slope for each batcht, distributed around group mean mu_b
    b = pm.Normal('b', mu=mu_b, sd=sigma_b, dims=("batch", "pred_num", "one"))
    
    ...

The rest of the code is exactly the same, except when sampling where I have added the kwarg return_inferencedata=True.

Once sampled, I have created an xarray dataset from the pred array in order to take advantage of the automatic broadcast provided by xarray that I mentioned above. As I have set return_inferencedata=True I already have all the variables in the model in xarray objects. To create an xarray variable we have to provide the values, the dimension names and the coordinate variables:

import xarray as xr
pred_ds = xr.DataArray(
    pred, dims=("x", "pred_num"), coords={"pred_num": coords["pred_num"]}
).to_dataset(name="pred")
post = hier_trace.posterior
y_ds = (post["mu_a"]+post["a"])+(post["mu_b"]+post["b"])*pred_ds
y_ds # in jupyter you'll see a nice html interactive summary
# here is the plain text output
# <xarray.Dataset>
# Dimensions:   (batch: 9, chain: 4, draw: 1000, one: 1, pred_num: 2, x: 48)
# Coordinates:
#   * pred_num  (pred_num) <U5 'pred1' 'pred2'
#   * chain     (chain) int64 0 1 2 3
#   * draw      (draw) int64 0 1 2 3 4 5 6 7 8 ... 992 993 994 995 996 997 998 999
#   * batch     (batch) int64 0 1 2 3 4 5 6 7 8
#   * one       (one) int64 1
# Dimensions without coordinates: x
# Data variables:
#     pred      (chain, draw, batch, pred_num, one, x) float64 1.938e+03 ... 67.98

We can then use ArviZ to loop over the desired variables and generate one subplot for each combination of batch and pred. It looks quite strange though because there is no rescaling at all.

import matplotlib.pyplot as plt
fig, axes = plt.subplots(n_pred, n_batch, figsize=(11,5), sharex=True, sharey="row")
x = np.arange(len(batch_id))

The first step is to create the suplots we will use to plot, and a dummy variable to represent the x.

from arviz.labels import BaseLabeller
labeller = BaseLabeller()
iterator = az.sel_utils.xarray_var_iter(
    y_ds.squeeze().transpose("chain", "draw", "x", "pred_num", "batch"), 
    combined=True, 
    skip_dims={"x",}
)

We then create this iterator object, the labeller is purely aesthetic (and has detailed docs here). The first step is transposing y_ds. This is also aesthetic, if the batch dimension came before, then when plotting we would not see all pred1 on the same row. combined=True and skip_dims are indicating that we don't want to iterate over chains nor over values in the x dimension (by default ArviZ does never iterate over draws. Now we can loop using this iterator to get arrays with chain, draw, x dimensions so we can plot the HDI, mean... and get a plot that kind of looks like the example in the question:

for ax, (_, sel, isel, values) in zip(axes.ravel(), iterator):
    az.plot_hdi(x, values, ax=ax)
    ax.plot(x, values.mean(axis=(0,1)))
    ax.set_title(labeller.make_label_vert("y", sel, isel))
fig.tight_layout()

plot

Note that the yaxis is only shared between rows, the scale of pred1 and pred2 are an order of magnitude different even if they look similar with this plot layout.

Side note: I have encountered a lot of divergences, and most rhat values are above the recommended 1.01. I am quite sure that NUTS has not converged, you probably should not trust these results.

Upvotes: 1

Related Questions