Reputation: 151
I have some ordinal data with missingness, which I am trying to model in Stan. Since Stan cannot handle discrete parameters directly, I am attempting to marginalize over the different possible values of the response variable for those cases which are missing.
Intuitively, I believe that I need the probability of the missing values being certain ordinal outcomes, multiplied by the probability of those ordinal outcomes, and then the probability of the missing values not being certain ordinal outcomes, multiplied by 1 - probability of those ordinal outcomes.
However, in practice I am uncertain if I am coding the model correctly in Stan itself or whether my exact mathematical intuition is correct for marginalising over the missing data. Although the Stan manual has examples of dealing with discrete parameters, I'm still a bit lost. I would also like to make predictions or attain an imputed vector of y
values in Stan generated quantities
block, but I could do with some pointers of getting started. The problem is not getting code running, more sanity checking that I have done it correctly.
As an example, consider the wine
data from the R package ordinal
. The R code below takes the rating
variable, transforms it to a 3 category scale for ease of explication, induces 15% missingness in the variable new_rating_w_miss
, and then creates a vector missing
that takes a 1 if the value is missing and a 0 otherwise. Since Stan won't take NA
values, I have given the missing values a value of -1
.
library(ordinal)
library(rstan)
library(runjags)
library(rjags)
data(wine)
wine$new_rating <- ifelse( wine$rating > 3, 3, wine$rating)
wine$new_rating_w_miss <- ifelse( wine$rating > 3, 3, wine$rating)
wine$temp_num <- ifelse(wine$temp == "cold", 1, 0)
set.seed(2017)
n_missing <- sample(0:nrow(wine), nrow(wine)*0.15, replace=FALSE)
wine$new_rating_w_miss[ n_missing ] <- -1
wine$is_missing <- ifelse(wine$new_rating_w_miss == -1, 1, 0 )
My Stan model code is as follows:
data {
int K;
int<lower=0> N;
int<lower=1> D;
int y[N];
int<lower=0> missing[N];
int<lower=0> N_miss;
row_vector[D] x[N];
}
parameters {
vector[D] beta;
real<lower=0,upper=1> p;
ordered[K-1] c;
}
model {
vector[K] theta;
vector[N_miss] lp;
beta ~ normal(0, 5);
p ~ beta(2,2);
for (n in 1:N) {
real eta;
eta = x[n] * beta;
theta[1] = Phi( (c[1] - eta) );
for (k in 2:(K-1))
theta[k] = Phi( (c[k] - eta)) - Phi( (c[k-1] - eta));
theta[K] = 1 - Phi( (c[K-1] - eta) );
if(missing[n] == 0){
y[n] ~ categorical(theta);
missing[n] ~ bernoulli(p);
}
if(missing[n] == 1){
target += log_mix( theta[1] ,
categorical_lpmf( 1 | theta ),
categorical_lpmf( (2 || 3) | theta )
)
+
log_mix( theta[2] ,
categorical_lpmf( 2 | theta ),
categorical_lpmf( (1 || 3) | theta )
)
+
log_mix( theta[3] ,
categorical_lpmf( 3 | theta ),
categorical_lpmf( (1 || 2) | theta )
);
missing[n] ~ bernoulli(p);
}
}
}
As can be seen, I have attempted to marginalise over the discrete unknown values by using the log_mix()
function in Stan. Specifically, adding up the (log) probabilities of the missing values being either 1
, 2
, or 3
, given the (log) probability of those values estimated from the observed data, and then the counter-(log) probability.
One question is: should I be estimating separate theta
vectors of probabilities from the observed and unobserved data?
The results:
Inference for Stan model: Stan_missing_model.
4 chains, each with iter=500; warmup=250; thin=1;
post-warmup draws per chain=250, total post-warmup draws=1000.
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
beta[1] -1.28 0.01 0.30 -1.83 -1.48 -1.28 -1.09 -0.67 808 1.00
p 0.16 0.00 0.04 0.09 0.13 0.15 0.18 0.24 812 1.00
c[1] -1.26 0.01 0.23 -1.70 -1.43 -1.26 -1.11 -0.81 699 1.00
c[2] -0.72 0.01 0.21 -1.15 -0.86 -0.70 -0.57 -0.27 746 1.00
lp__ -119.80 0.08 1.47 -123.45 -120.57 -119.34 -118.71 -118.02 378 1.02
Running the model in R
with the wine
data above:
stan_data <- list( K = 3, N = nrow(wine), D = 1,
y = wine$new_rating_w_miss,
missing = wine$is_missing,
N_miss = sum(wine$is_missing),
x = as.matrix(wine$temp_num)
)
fit_stan <- stan(file = "Stan_missing_model.stan", data = stan_data,
cores = 4, chains = 4, warmup = 250, iter = 500)
print(fit_stan)
A similar model in JAGS, which automatically imputes NA values, seem to arrive at not-too-dissimilar parameter estimates:
model{
for( i in 1:N ){
y[i] ~ dcat( theta[i, 1:3] )
theta[i,1] <- phi( c[1] - mu[i] )
theta[i,2] <- phi( c[2] - mu[i] ) - phi( c[1] - mu[i] )
theta[i,3] <- 1 - phi( c[2] - mu[i] )
mu[i] <- beta * temp[i]
miss[i] ~ dbern( p )
}
beta ~ dnorm(0 , 5)
p ~ dbeta( 2, 2)
for(i in 1:2){
c_raw[i] ~ dnorm(0,1.0E-3)
}
c <- sort(c_raw)
}
jags_data <- list( N = nrow(wine),
y = ifelse(wine$new_rating_w_miss == -1,
NA, wine$new_rating_w_miss),
temp = wine$temp_num,
miss = wine$is_missing
)
jags_monitor <- c("beta","p","c")
fit_jags <- run.jags(model = "JAGS_so_model.txt",
data = jags_data, monitor = jags_monitor,
inits = list("c_raw" = c(-0.5,0.5)),
n.chains = 4, burnin = 500, sample = 500)
summary(fit_jags)
Lower95 Median Upper95 Mean SD Mode MCerr MC%ofSD SSeff AC.10 psrf
beta -1.40131934 -0.8531391 -0.3148070 -0.8516923 0.2783449 NA 0.022165423 8.0 158 0.05891468 1.0284938
p 0.07983462 0.1547419 0.2356164 0.1576307 0.0410555 NA 0.001264152 3.1 1055 -0.00592979 0.9989195
c[1] -2.64389817 -2.0139135 -1.4006353 -2.0246384 0.3322312 NA 0.024410375 7.3 185 0.08432516 1.0213586
c[2] -1.35215148 -0.9021336 -0.4709347 -0.9031153 0.2319716 NA 0.019547231 8.4 141 0.07087327 1.0153249
Upvotes: 3
Views: 1212
Reputation: 3753
You need to code something equal to the log posterior plus a constant. That's usually the log joint for the model.
Missing continuous data is typically modeled using a parameter---see the manual chapter on missing data.
The ||
operation is logical or, so categorical_lpmf((2 || 3) | theta)
is going to be equal to categorical_lpmf(1 | theta)
because (2 || 3)
evaluates to 1
(any non-zero value is treated as logically true and 1
encodes the true value).
If you have missing discrete data x
that can take on values 1:K
with log density/mass lp[k]
when x == k
, then the way to code it is as just target += log_sum_exp(lp);
You need to make sure to include the normalizing term in the lp[k]
.
You might want to work through everything and see if the missing data has any impact on the posterior---it often just factors out.
You don't observe different parameters for observed and unobserved data.
Upvotes: 3