Marginalize over missing discrete response data in Stan

Question

I have some ordinal data with missingness, which I am trying to model in Stan. Since Stan cannot handle discrete parameters directly, I am attempting to marginalize over the different possible values of the response variable for those cases which are missing.

Intuitively, I believe that I need the probability of the missing values being certain ordinal outcomes, multiplied by the probability of those ordinal outcomes, and then the probability of the missing values not being certain ordinal outcomes, multiplied by 1 - probability of those ordinal outcomes.

However, in practice I am uncertain if I am coding the model correctly in Stan itself or whether my exact mathematical intuition is correct for marginalising over the missing data. Although the Stan manual has examples of dealing with discrete parameters, I'm still a bit lost. I would also like to make predictions or attain an imputed vector of y values in Stan generated quantities block, but I could do with some pointers of getting started. The problem is not getting code running, more sanity checking that I have done it correctly.

Example

As an example, consider the wine data from the R package ordinal. The R code below takes the rating variable, transforms it to a 3 category scale for ease of explication, induces 15% missingness in the variable new_rating_w_miss, and then creates a vector missing that takes a 1 if the value is missing and a 0 otherwise. Since Stan won't take NA values, I have given the missing values a value of -1.

library(ordinal)
library(rstan)
library(runjags)
library(rjags)
data(wine)

wine$new_rating <- ifelse( wine$rating > 3, 3, wine$rating)
wine$new_rating_w_miss <- ifelse( wine$rating > 3, 3, wine$rating)
wine$temp_num <- ifelse(wine$temp == "cold", 1, 0)

set.seed(2017)
n_missing <- sample(0:nrow(wine), nrow(wine)*0.15, replace=FALSE)

wine$new_rating_w_miss[ n_missing ] <- -1
wine$is_missing <- ifelse(wine$new_rating_w_miss == -1, 1, 0 )

Stan model

My Stan model code is as follows:

data {
  int K;
  int N;
  int D;
  int y[N];
  int missing[N];
  int N_miss;
  row_vector[D] x[N];
}

parameters {
  vector[D] beta;
  real p;
  ordered[K-1] c;
}

model {
  vector[K] theta;
  vector[N_miss] lp;
  beta ~ normal(0, 5);
  p ~ beta(2,2);
  for (n in 1:N) {
    real eta;
    eta = x[n] * beta;
    theta[1] = Phi( (c[1] - eta) );
    for (k in 2:(K-1))
      theta[k] = Phi( (c[k] - eta)) - Phi( (c[k-1] - eta));
    theta[K] = 1 - Phi( (c[K-1] - eta) );

    if(missing[n] == 0){
      y[n] ~ categorical(theta);
      missing[n] ~ bernoulli(p);
  }

if(missing[n] == 1){
target += log_mix( theta[1] ,
                    categorical_lpmf( 1 | theta ),
                    categorical_lpmf( (2 || 3) | theta )
                  )

            +

            log_mix( theta[2] ,
                                categorical_lpmf( 2 | theta ),
                                categorical_lpmf( (1 || 3) | theta )
                              )

            +

            log_mix( theta[3] ,
                                categorical_lpmf( 3 | theta ),
                                categorical_lpmf( (1 || 2) | theta )
                              );
  missing[n] ~ bernoulli(p);
    }
 }
}

As can be seen, I have attempted to marginalise over the discrete unknown values by using the log_mix() function in Stan. Specifically, adding up the (log) probabilities of the missing values being either 1, 2, or 3, given the (log) probability of those values estimated from the observed data, and then the counter-(log) probability.

One question is: should I be estimating separate theta vectors of probabilities from the observed and unobserved data?

The results:

Inference for Stan model: Stan_missing_model.
4 chains, each with iter=500; warmup=250; thin=1; 
post-warmup draws per chain=250, total post-warmup draws=1000.

           mean se_mean   sd    2.5%     25%     50%     75%   97.5% n_eff Rhat
beta[1]   -1.28    0.01 0.30   -1.83   -1.48   -1.28   -1.09   -0.67   808 1.00
p          0.16    0.00 0.04    0.09    0.13    0.15    0.18    0.24   812 1.00
c[1]      -1.26    0.01 0.23   -1.70   -1.43   -1.26   -1.11   -0.81   699 1.00
c[2]      -0.72    0.01 0.21   -1.15   -0.86   -0.70   -0.57   -0.27   746 1.00
lp__    -119.80    0.08 1.47 -123.45 -120.57 -119.34 -118.71 -118.02   378 1.02

Running the model in R

Running the model in R with the wine data above:

stan_data <- list( K = 3, N = nrow(wine), D = 1, 
                   y = wine$new_rating_w_miss, 
                   missing = wine$is_missing, 
                   N_miss = sum(wine$is_missing),
                   x = as.matrix(wine$temp_num) 
                  )

fit_stan <- stan(file = "Stan_missing_model.stan", data = stan_data, 
                  cores = 4, chains = 4, warmup = 250, iter = 500)
print(fit_stan)

Estimating a similar model in JAGS

A similar model in JAGS, which automatically imputes NA values, seem to arrive at not-too-dissimilar parameter estimates:

model{
   for( i in 1:N ){
       y[i] ~ dcat( theta[i, 1:3] )
       theta[i,1] <- phi( c[1] - mu[i] )
       theta[i,2] <- phi( c[2] - mu[i] ) - phi( c[1] - mu[i] )
       theta[i,3] <- 1 - phi( c[2] - mu[i] )

       mu[i] <- beta * temp[i]
       miss[i] ~ dbern( p )
    }

  beta ~ dnorm(0 , 5)
  p ~ dbeta( 2, 2)
  for(i in 1:2){
    c_raw[i] ~ dnorm(0,1.0E-3)
   }
  c <- sort(c_raw)

}

jags_data <- list( N = nrow(wine), 
                   y = ifelse(wine$new_rating_w_miss == -1, 
                              NA, wine$new_rating_w_miss), 
                   temp = wine$temp_num, 
                   miss = wine$is_missing
                  )

jags_monitor <- c("beta","p","c")

fit_jags <- run.jags(model = "JAGS_so_model.txt", 
                    data = jags_data, monitor = jags_monitor,
                    inits = list("c_raw" = c(-0.5,0.5)),
                    n.chains = 4, burnin = 500, sample = 500)
summary(fit_jags)

         Lower95     Median    Upper95       Mean        SD Mode       MCerr MC%ofSD SSeff       AC.10      psrf
beta -1.40131934 -0.8531391 -0.3148070 -0.8516923 0.2783449   NA 0.022165423     8.0   158  0.05891468 1.0284938
p     0.07983462  0.1547419  0.2356164  0.1576307 0.0410555   NA 0.001264152     3.1  1055 -0.00592979 0.9989195
c[1] -2.64389817 -2.0139135 -1.4006353 -2.0246384 0.3322312   NA 0.024410375     7.3   185  0.08432516 1.0213586
c[2] -1.35215148 -0.9021336 -0.4709347 -0.9031153 0.2319716   NA 0.019547231     8.4   141  0.07087327 1.0153249

Marginalize over missing discrete response data in Stan

Example

Stan model

Running the model in R

Estimating a similar model in JAGS

Answers (1)

Related Questions