The Pointer
The Pointer

Reputation: 2386

Problems with if() condition in Stan/RStan when modelling values from binomial random variable

I'm trying to use Stan and R to fit a model that, uhh, models the observed realisations y_i = 16, 9, 10, 13, 19, 20, 18, 17, 35, 55, which are from a binomial distributed random variable, say, Y_i, with parameters m_i (number of trials) and p_i (success probability in each trial).

yi = c(16, 9, 10, 13, 19, 20, 18, 17, 35, 55)

For the purposes of this experiment, I'm going to assume that all of the m_i are fixed and given by m_i = 74, 99, 58, 70, 122, 77, 104, 129, 308, 119.

mi = c(74, 99, 58, 70, 122, 77, 104, 129, 308, 119)

I'm going to use Jeffrey's prior: \alpha=0.5 and \beta=0.5.

alpha = 0.5, beta = 0.5

I'm trying to

  1. Find Bayesian estimates for the p_i.
  2. Find the range of the p_i (i.e., the parameters k as follows: enter image description here

My attempt at 2. is this section of code:

        real k;
        real mx = 0;
        real mn = 0;
        if (p > mx) 
          mx = p;
        if (mn > p) {
          mn = p;
        }
        k = mx - mn;

My Stan code is as follows:

```{stan output.var="BinModBeta"}
  data {
    int <lower = 1> mi[10];
    int <lower = 0> yi[10];
    real <lower = 0> alpha;
    real <lower = 0> beta;
  }

  parameters {
    real <lower = 0, upper = 1> p[10];
  }

  transformed parameters {
    real k;
    real mx = 0;
    real mn = 0;
    if (p > mx) 
      mx = p;
    if (mn > p) {
      mn = p;
    }
    k = mx - mn;
  }

  model {
    yi ~ binomial(mi, p);
    p ~ beta(alpha, beta);
  }
```

My R code is as follows:

```{r}
library(rstan)
```

```{r}
data.in <- list(mi = c(74, 99, 58, 70, 122, 77, 104, 129, 308, 119), yi = c(16, 9, 10, 13, 19, 20, 18, 17, 35, 55), alpha = 0.5, beta = 0.5)
model.fit1 <- sampling(BinModBeta, data=data.in)
```

```{r}
print(model.fit1, pars = c("p"), probs=c(0.1,0.5,0.9), digits = 5)
```

Now, I just started learning Stan, so I'm honestly not sure if this is correct at all. However, it seems like this code works for my first aim (at least, whatever I've coded seems to work ...). But my trouble starts when attempting to code my second aim.

When I attempt to compile the Stan code above, I get the following error:

enter image description here

Now, based on this error message, it seems like my issue is arising from the fact that p is a vector of 10 real values, rather than a single real. However, due to my inexperience with Stan, I'm unsure of how to get around this problem.

Upvotes: 3

Views: 2737

Answers (1)

Maurits Evers
Maurits Evers

Reputation: 50718

Here is what I would do:

model <- "
data {
  int <lower = 1> mi[10];
  int <lower = 0> yi[10];
  real <lower = 0> alpha;
  real <lower = 0> beta;
}

parameters {
  real <lower = 0, upper = 1> p[10];
}

model {
    p ~ beta(alpha, beta);        // Prior
    yi ~ binomial(mi, p);         // Likelihood
}

generated quantities {
    real k;
    k = max(p) - min(p);
}
"

library(rstan);
yi = c(16, 9, 10, 13, 19, 20, 18, 17, 35, 55);
mi = c(74, 99, 58, 70, 122, 77, 104, 129, 308, 119);
fit <- stan(
    model_code = model,
    data = list(mi = mi, yi = yi, alpha = 0.5, beta = 0.5))
fit;
#Inference for Stan model: 6a01a3b25656e1b18183baf19183abf7.
#4 chains, each with iter=2000; warmup=1000; thin=1;
#post-warmup draws per chain=1000, total post-warmup draws=4000.
#
#         mean se_mean   sd    2.5%     25%     50%     75%   97.5% n_eff Rhat
#p[1]     0.22    0.00 0.05    0.13    0.19    0.22    0.25    0.32  4000    1
#p[2]     0.10    0.00 0.03    0.05    0.07    0.09    0.11    0.16  4000    1
#p[3]     0.18    0.00 0.05    0.09    0.14    0.17    0.21    0.28  4000    1
#p[4]     0.19    0.00 0.05    0.11    0.16    0.19    0.22    0.29  4000    1
#p[5]     0.16    0.00 0.03    0.10    0.14    0.16    0.18    0.22  4000    1
#p[6]     0.26    0.00 0.05    0.17    0.23    0.26    0.30    0.37  4000    1
#p[7]     0.18    0.00 0.04    0.11    0.15    0.17    0.20    0.25  4000    1
#p[8]     0.13    0.00 0.03    0.08    0.11    0.13    0.15    0.20  4000    1
#p[9]     0.11    0.00 0.02    0.08    0.10    0.11    0.13    0.15  4000    1
#p[10]    0.46    0.00 0.04    0.38    0.43    0.46    0.49    0.55  4000    1
#k        0.38    0.00 0.05    0.28    0.34    0.38    0.41    0.47  4000    1
#lp__  -530.01    0.05 2.26 -535.38 -531.33 -529.65 -528.37 -526.69  1782    1
#
#Samples were drawn using NUTS(diag_e) at Tue Apr 24 22:02:07 2018.
#For each parameter, n_eff is a crude measure of effective sample size,
#and Rhat is the potential scale reduction factor on split chains (at
#convergence, Rhat=1).

Comments:

  1. I would move the part involving the calculation of k into the generated quantities block; this has to do with the different program blocks being executed at different times. While the transformed parameters block is executed once every leapfrog step, the generated quantities block is only executed once per sample draw. So there will be less overhead in recalculating k. See e.g. here for details. Note that uncertainties from the pi posterior densities get properly propagated to k.

  2. You can make use of Stans internal max,min functions when calculating k. This will be faster than determining the min/max of pi with if conditions, and also eliminates the need to define mn and mx.

Upvotes: 2

Related Questions