STAN code: indexing and variable declaration

I’m trying to move from JAGS to STAN but I don’t understand variable declaration and indexes for looping.

I have some data that I want to model in STAN with a beta distribution. The rationale of the task is as follows: N participants produce T estimations. These estimations are drawn from a beta distribution with parameters a and b that are fixed for each individual participant. Thus, my data look like this:

   ID Trial  Response
1   1     1 0.9771303
2   1     2 0.7455432
3   1     3 0.7949766
4   1     4 0.0490466
5   1     5 0.1096347
6   2     1 0.6766598
7   2     2 0.4160906
8   2     3 0.5342134
9   2     4 0.5437952
10  2     5 0.5560163
11  3     1 0.4378267
12  3     2 0.3561471
13  3     3 0.4455051
14  3     4 0.5324450
15  3     5 0.7438497

You can simulate this:

    a <- c(1, 6, 3)
    b <- c(1, 4, 5)
    data.frame(ID = rep(c(1, 2, 3), each = 5), Trial = rep(1:5), Response = c(rbeta(5, a[1], b[1]), rbeta(5, a[2], b[2]), rbeta(5, a[3], b[3])))

Now, my STAN code is supposed to take the responses from each trial and participant to estimate the underlying parameters of the beta distribution, a and b, which vary between individuals. In JAGS, I would do a nested loop, something like:`

    for(i in 1:N) {
        for(j in 1:trials) {
            y[i,j] ~ dbeta(a[i], b[i])
    }
    }

But, how would you specify a model like this in STAN? My code isn’t working but I don’t know if the problem is with the parameter definition, the data structure… It should be easy.

Upvotes: 1

Views: 57

Answers (1)

Anders Gorm
Anders Gorm

Reputation: 35

One approach would be to use "nested indexing": You then need only two vectors: one vector of outcomes ("Response", where each value is in the range [0,1]) and one vector of ID values (where each value is either 1, 2, or 3, corresponding to the 3 persons in the experiment). Both of these vectors should be of length N (the total number of data points). So something looking like this:

> ID
 [1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3
> Response
 [1] 0.9995234 0.7859138 0.9598987 0.4499918 0.8160071 0.5721492 0.7056597 0.6387936
 [9] 0.5273212 0.6832860 0.4897549 0.4070146 0.3150005 0.1806185 0.3810229

In the model you can then have a non-nested for loop over all N data points, and for each iteration refer to the model parameters for the relevant beta-distribution using nested indexing, along these lines:

for(i in 1:N) {
    Response[i] ~ beta(a[ID[i]], b[ID[i]]);
}

The trick here is to use a[ID[i]] (nested indexing) which will refer to either a[1], a[2], or a[3], and where the ID vector keeps track of which person the current data point corresponds to.

Actually, the beta distribution is vectorized so you don't need to loop in the first place:

Response ~ beta(a[ID], b[ID]);

Below is an example with all necessary R-code (and with explicit looping). Note that it would probably be a good idea to explicitly add a prior for the a and b vectors (omitting the prior means that you get the default which is an improper, uniform prior over the allowed interval - here all values > 0).

library(tidyverse)
library(rstan)

a = c(1, 6, 3)
b = c(1, 4, 5)
ntrials = 100
x = tibble(
  ID = rep(c(1, 2, 3), each = ntrials), 
  Response = c(rbeta(ntrials, a[1], b[1]), 
               rbeta(ntrials, a[2], b[2]), 
               rbeta(ntrials, a[3], b[3]))
)

stan_dat = list(
  N = nrow(x),
  Npersons = n_distinct(x$ID),
  ID = x$ID,
  Response = x$Response
)

stan_mod = " 
data {
  int<lower=0> N;
  int<lower=0> Npersons;
  array[N] int ID;
  vector[N] Response;
}
parameters {
  vector<lower=0>[Npersons] a;
  vector<lower=0>[Npersons] b;
}
model {
  // No prior, so implicitly flat (uniform) and improper
  // Likelihood
  for(i in 1:N) {
    Response[i] ~ beta(a[ID[i]], b[ID[i]]);
  }
}
"

fit = stan(model_code=stan_mod, data=stan_dat, chains=4, iter=2000)
print(fit)

Upvotes: 0

Related Questions