Ben
Ben

Reputation: 42313

linear model diagnostics for Bayesian models using rstan

I'm looking for an efficient method to identify data points that have an outsize effect on the parameters of a linear model. This is straight-forward with ordinary linear models, but I'm not sure how to do it with Bayesian linear models.

Here's one way with ordinary linear models, we can compute the Cook's distance for each data point, and plot diagnostic plots that include Cook's distances:

# ordinary linear model diagnostics, similar to my use-case
library(dplyr)
library(purrr)
library(tidyr)
library(broom)
# make linear models for each type of species
xy <- 
  iris %>% 
  nest(-Species) %>% 
  mutate(model = map(data, 
                     ~lm(Sepal.Length ~ Petal.Width, 
                         data = .)),
         fit = map(model, augment)) 

Here we have a data frame with nested lists, the model column contains the linear model for each species:

> xy
# A tibble: 3 × 4
     Species              data    model                   fit
      <fctr>            <list>   <list>                <list>
1     setosa <tibble [50 × 4]> <S3: lm> <data.frame [50 × 9]>
2 versicolor <tibble [50 × 4]> <S3: lm> <data.frame [50 × 9]>
3  virginica <tibble [50 × 4]> <S3: lm> <data.frame [50 × 9]>

The broom::augment function allowed us to add the Cook's distance values for each data point into this data frame, and we can inspect them like this:

# inspect Cook's distance values
xy %>% 
 unnest(fit) %>% 
 arrange(desc(.cooksd))

  # A tibble: 150 × 10
      Species Sepal.Length Petal.Width  .fitted    .se.fit     .resid       .hat    .sigma    .cooksd
       <fctr>        <dbl>       <dbl>    <dbl>      <dbl>      <dbl>      <dbl>     <dbl>      <dbl>
1  versicolor          5.9         1.8 6.612097 0.16181001 -0.7120969 0.13725081 0.4269862 0.24507448
2      setosa          5.0         0.6 5.335281 0.17114108 -0.3352811 0.25027563 0.3410686 0.21385214
3   virginica          4.9         1.7 6.375829 0.13613717 -1.4758292 0.04875277 0.5826838 0.15434787
4      setosa          5.7         0.4 5.149247 0.08625887  0.5507534 0.06357957 0.3355980 0.09396588
5      setosa          4.3         0.1 4.870195 0.08321347 -0.5701948 0.05916942 0.3349111 0.09285408
6   virginica          5.8         2.4 6.831411 0.14828703 -1.0314106 0.05784319 0.6035012 0.09117693
7   virginica          7.2         1.6 6.310746 0.16207266  0.8892538 0.06909799 0.6084108 0.08293253
8  versicolor          4.9         1.0 5.471005 0.11998077 -0.5710051 0.07546185 0.4328038 0.07544526
9      setosa          5.8         0.2 4.963212 0.05287342  0.8367879 0.02388828 0.3228858 0.07500610
10 versicolor          6.0         1.0 5.471005 0.11998077  0.5289949 0.07546185 0.4340307 0.06475225
# ... with 140 more rows, and 1 more variables: .std.resid <dbl>

And with the autoplot method, we can make informative diagnostic plots that show Cook's distance values, and help us to quickly identify data points with an outsize effect on the models' paramters:

# plot model diagnostics
library(ggplot2)
library(ggfortify)
diagnostic_plots_df <- 
  xy %>%  
  mutate(diagnostic_plots = map(model, 
                                ~autoplot(., 
                                          which = 1:6, 
                                          ncol = 3, 
                                          label.size = 3)))

Here's just one of the plots produced:

> diagnostic_plots_df[[1]]

enter image description here

Now, with the Bayesian linear model we can similarly compute linear models for each group in the data frame:

# Bayesian linear model diagnostics
library(rstanarm)
bayes_xy <- 
  iris %>% 
  nest(-Species) %>% 
  mutate(model = map(data, 
                     ~stan_lm(Sepal.Length ~ Petal.Width, 
                         data = .,
                         prior = NULL, 
                         chains = 1, 
                         cores = 2, 
                         seed = 1)),
         fit =  map(model, augment))

> bayes_xy
# A tibble: 3 × 4
     Species              data         model                   fit
      <fctr>            <list>        <list>                <list>
1     setosa <tibble [50 × 4]> <S3: stanreg> <data.frame [50 × 5]>
2 versicolor <tibble [50 × 4]> <S3: stanreg> <data.frame [50 × 5]>
3  virginica <tibble [50 × 4]> <S3: stanreg> <data.frame [50 × 5]>

But the broom::augment method doesn't have anything like a Cook's distance value:

# inspect fit diagnostics
bayes_xy %>% unnest(fit)

# A tibble: 150 × 6
   Species Sepal.Length Petal.Width  .fitted    .se.fit      .resid
    <fctr>        <dbl>       <dbl>    <dbl>      <dbl>       <dbl>
1   setosa          5.1         0.2 4.963968 0.06020298  0.13482025
2   setosa          4.9         0.2 4.963968 0.06020298 -0.06517975
3   setosa          4.7         0.2 4.963968 0.06020298 -0.26517975
4   setosa          4.6         0.2 4.963968 0.06020298 -0.36517975
5   setosa          5.0         0.2 4.963968 0.06020298  0.03482025
6   setosa          5.4         0.4 5.151501 0.11299956  0.21818386
7   setosa          4.6         0.3 5.057734 0.05951488 -0.47349794
8   setosa          5.0         0.2 4.963968 0.06020298  0.03482025
9   setosa          4.4         0.2 4.963968 0.06020298 -0.56517975
10  setosa          4.9         0.1 4.870201 0.11408783  0.04313845
# ... with 140 more rows

And no autoplot method:

# plot model diagnostics
bayes_diagnostic_plots_df <- 
  bayes_xy %>%  
  mutate(diagnostic_plots = map(model, 
                                ~autoplot(., 
                                          which = 1:6, 
                                          ncol = 3, 
                                          label.size = 3)))

# Error, there doesn't seem to be an autoplot method for stan_lm output

shinystan::launch_shinystan(bayes_xy$model[[1]])

# This is quite interesting, but nothing at the level of specific data points

Some of the scholarly literature talks about methods such as model perturbations, phi-divergence, Cook's posterior mode distance and Cook's posterior mean distance, Kullback-Leibler divergence, etc., etc.. But I can't see anywhere this has been explored with R code, and I'm stuck.

There is an unanswered question on this topic at Cross-validated. I'm posting here because I'm looking for ideas about writing code to compute influence statistics (rather than advice about the statistical theory and approach, etc., which should go at that other question)

How can I get something like a Cook's Distance measure from the output of rstanarm::stan_lm?

Upvotes: 5

Views: 1037

Answers (2)

Ben Goodrich
Ben Goodrich

Reputation: 4990

This post said by Aki Vehtari said it best:

The difference between lppd_i and loo_i has been used as a sensitivity measure (see, e.g., Gelfand et al 1992). Pareto shape parameter estimate k is likely to be large if the difference between lppd_i and loo_i is large. It's not yet clear to me whether the Pareto shape parameter estimate k would be better than lppd_i-loo_i, but at least we know that estimate for lppd_i-loo_i is too small if k is close to 1 or larger, so it might be better to look at k. In stack example with normal model, k for one observation is large, but with student-t model k is smaller. Normal model is same as student-t model, but with very strong prior on degrees of freedom. So it's not just about having strong prior or more shrinkage, but having a model which can describe the observations well. With increased shrinkage and non-robust observation model, the one observation could still be surprising. Naturally it's not always the best solution to change to a more robust observation model allowing "outliers". Instead it might be better to make the regression function more nonlinear (that is having a less strong prior), or transform covariates, or add more covariates. So I do recommend looking at Pareto shape parameter values, but I don't recommend to increase shrinkage if the values are large.

You can obtain the Pareto shape parameter estimate k from the $pareto_k element of the list produced by the loo function in the loo package, which is reexported by the rstanarm package. If this value is higher than 0.7 (by default), the loo function will recommend that you re-fit the model leaving out this observation because the posterior distribution is likely to be too sensitive to this observation to satisfy the assumption of the LOOIC that each observation has a negligible impact on the posterior distribution.

In the OP's case, only the seventh observation has a Pareto shape parameter estimate that is slightly greater than 0.5, so that observation probably does not have an extreme effect on the posterior. But you definitely want to investigate observations that have a value greater than 1.0

You can also call the plot method for a loo object, especially with the non-default option label_points = TRUE to visualize the Pareto shape parameter estimates.

Upvotes: 4

Ben
Ben

Reputation: 42313

Looking into some of the discussion on the stan-users email list, I saw that output from the loo package, contains 'pointwise contributions for each observation'. So here's an attempt at working with those:

# Bayesian linear model diagnostics
library(rstanarm)
library(loo)
bayes_xy <- 
  iris %>% 
  nest(-Species) %>% 
  mutate(model = map(data, 
                     ~stan_lm(Sepal.Length ~ Petal.Width, 
                         data = .,
                         prior = NULL, 
                         chains = 1, 
                         cores = 2, 
                         seed = 1)))


bayes_xy_loo <- 
bayes_xy %>% 
  mutate(loo_out = map(model, ~loo(.)))

library(ggplot2)
library(ggrepel)
n <-  5 # how many points to label


my_plots <- 
bayes_xy_loo %>% 
  select(loo_out) %>% 
  mutate(loo_pointwise = map(.$loo_out, ~data.frame(.$pointwise))) %>% 
  mutate(plots = map(.$loo_pointwise, 
      ~ggplot(., 
       aes(elpd_loo,
           looic)) +
       geom_point(aes(size = p_loo)) +
       geom_text_repel(data = .[tail(order(.$looic), n),] ,
                  aes(label = row.names(.[tail(order(.$looic), n),])),
                  nudge_x = -0.1,
                  nudge_y = -0.3) +
        geom_label_repel(data = .[tail(order(.$elpd_loo), n),] ,
                        aes(label = row.names(.[tail(order(.$elpd_loo), n),])),
                        nudge_x = 0.1,
                        nudge_y = 0.1) +
       xlab("Expected log pointwise \npredictive density") +
       ylab("LOO information \ncriterion") +
       scale_size_area(name = "Estimated \neffective\nnumber \nof parameters") +
       theme_minimal(base_size = 10)))

do.call(gridExtra::grid.arrange, my_plots$plots)

enter image description here

However, the points suggested to be influential are not a good match. For example, in the question we have obs. 7, 15, and 30 with high Cook's Distance values. In the loo output, it seems that only obs. 15 is identified as usual. So perhaps this isn't the right way to do it.

Upvotes: 1

Related Questions