Reputation: 10558

Simulating draws/data from a fitted model with factor/character level inputs

I have data that looks like this:

library(tidyverse)
set.seed(2017)

df <- tibble(
    product = c(rep("A", 50), rep("B", 50)),
    sales = round(c(rnorm(50, mean = 55, sd = 10), rnorm(50, mean = 60, sd = 15)))
)

I can build a linear regression on the data:

mod1 <- lm(sales ~ product, data = df)

And predict the sales from products "A" and "B":

predict(mod1, tibble(product = c("A", "B")))

>     1     2 
> 55.78 58.96

But I want to simulate draws from the fitted model instead of just predicting the fitted values. I want draws so that I can capture the uncertainty around the point-estimate (without having to use SDs, CIs, etc, etc).

I would normally use simulate() and change the model_object$fitted.values. But I can't do that because the inputs for my model are factor/character levels ("A" and "B").

I can get the shape of the distributions:

a_mu <- coef(summary(mod1))["(Intercept)", "Estimate"] 
a_se <- coef(summary(mod1))["(Intercept)", "Std. Error"] 

b_mu <- coef(summary(mod1))["productB", "Estimate"] 
b_se <- coef(summary(mod1))["productB", "Std. Error"]

And simulate draws like this:

N <- 100

product_A <- replicate(N,
    rnorm(n = 1, mean = a_mu, sd = a_se) + rnorm(n = 1, mean = b_mu, sd = b_se) * 0)

product_B <- replicate(N,
    rnorm(n = 1, mean = a_mu, sd = a_se) + rnorm(n = 1, mean = b_mu, sd = b_se) * 1)

And stuff it all in a tibble for visualization:

pred <- tibble(A = product_A, B = product_B)

But this process seems super janky. And won't scale if my data grows to, say, 5 input variables, with 10 factor levels for each. So, how can I make this generalizable?

I'd prefer to stay in base R and/or the tidyverse. And yes, I know that I'm flirting with Bayesian Statistics here and that I perhaps could use Stan to draw from the posterior... but that's not the point.

Upvotes: 3

Answers (4)

eipi10

Reputation: 93871

Gelman and Hill (2007)¹ provide a Bayesian-flavored function for estimating uncertainty in a frequentist regression using simulation. The function is described starting at the bottom of page 142 in their (IMHO excellent) text, which can be viewed on google books.

The function is called sim and is available from the arm package (which is the package that accompanies Gelman and Hill's text). It uses the model parameters (including taking account of the covariance and standard errors of the coefficients) to simulate from the joint distribution of the coefficients. The function has changed since the book was published and now returns an S4 object which is accessed using slots, so the actual implementation is slightly different than described in the book.

Here's an example using your data:

library(ggplot2)
library(ggbeeswarm)
theme_set(theme_classic())
library(arm)

First, we'll generate 1000 simulations of the model coefficients using the sim function:

sim.mod = sim(mod1, 1000)

The coefficients for each simulation can be found in sim.mod@coef, which is a matrix. Here are the first four lines:

sim.mod@coef[1:4,]

     (Intercept)  productB
[1,]    55.25320 3.5782625
[2,]    59.90534 0.4608387
[3,]    55.79126 5.1872595
[4,]    57.97446 1.0012827

Now let's extract the coefficient simulations, convert them to a data frame, and shorten the column names. This will give us a data frame sc with one column for the the simulated intercepts and one for the simulated coefficients of the dummy variable for product=="B":

sc = setNames(as.data.frame(sim.mod@coef), c("Int","prodB"))

From here, you can use the simulations to assess uncertainty in and likely ranges for the coefficients and the predicted sales. Below are some visualizations.

Let's plot a regression line in blue for each simulated pair of coefficients. We'll get 1,000 lines, and the density of the lines will show us the most likely combinations of coefficients. We also show the fitted regression line in yellow and the underlying data points in red. Obviously the lines are only meaningful at the A and B points on the x-axis. This is similar to how Gelman and Hill show the simulation results in their book.

ggplot() + 
  geom_abline(data=sc, aes(slope=prodB, intercept=Int), colour="blue", alpha=0.03) +
  geom_beeswarm(data=df, aes(product, sales), alpha=1, colour="red", size=0.7) +
  geom_abline(slope=coef(mod1)[2], intercept=coef(mod1)[1], colour="yellow", size=0.8)

Another option is to calculate the predicted mean sales for each product for each pair of simulated coefficients. We do that below and plot the results as a violin plot. In addition, we include the median prediction for average sales, as well as the range from the 2.5% to 97.5% quantiles of average sales:

pd = data.frame(product=rep(c("A","B"), each=1000), sc)
pd$sales = ifelse(pd$product=="A", pd$Int, pd$Int + pd$prodB)

ggplot(pd, aes(product, sales)) + 
  geom_violin() +
  stat_summary(fun.data=median_hilow, colour="red", geom="errorbar", width=0.05, size=0.8, alpha=0.6) +
  stat_summary(fun.y=mean, aes(label=round(..y..,1)), geom="text", size=4, colour="blue")

Finally, we plot the distribution of the simulated coefficient values with 50% and 95% ellipses. coord_equal() ensures that one unit covers the same physical distance on the horizontal and vertical axes. The intercept (horizontal axis) is the predicted value of sales when product=="A". The slope (vertical axis) is the predicted difference in sales (relative to product=="A") when product=="B":

ggplot(sc, aes(Int, prodB)) +
  geom_point(alpha=0.5, colour="red", size=1) +
  stat_ellipse(level=c(0.5), colour="blue") +
  stat_ellipse(level=c(0.95), colour="blue") +
  coord_equal() +
  scale_x_continuous(breaks=seq(50,62,2)) +
  scale_y_continuous(breaks=seq(-6,12,2))

Visualization will be more complex if you have multiple variables, but the principles are similar to the single-predictor case illustrated above. The sim function will work with multiple predictor variables and categorical variables with multiple levels, so this approach should scale to more complex data sets.

A. Gelman and J. Hill, Data Analysis Using Regression and Multilevel/Hierarchical Models, Cambridge University Press (2007).

Upvotes: 5

A. Webb

Reputation: 26446

Well, let's take a look at Bayesian linear regression, do some sampling, and then compare to frequentist prediction intervals. We'll try to follow the notation in the linked Wikipedia page, where we'll be working on the posterior distribution.

Setting up the Bayesian posterior simulation

X <- model.matrix(sales~product,data=df)
n <- nrow(X)
k <- ncol(X)
v <- n - k

y <- df$sales

# Take Lambda_0 <- 0 so these simplify
beta.hat <- solve(crossprod(X),crossprod(X,y))
S <- solve(crossprod(X)) # Lambda_n^-1
mu_n <- beta.hat
a_n <- v/2 # I think this is supposed to be v instead of n, the factor with k was dropped?
b_n <- (crossprod(y) - crossprod(mu_n, crossprod(X) %*% mu_n))/2

Running the simulation

Now we draw sigma^2 from an inverse gamma a_n, b_n

N <- 10000
sigma.2 <- 1/rgamma(N, a_n, b_n)

And draw beta from a normal u_n, S * sigma.2 (just generated)

require("MASS")
beta <- sapply(sigma.2, function(s) MASS::mvrnorm(1, mu_n, S * s))

We'll put this all into a data.frame

sim <- data.frame(t(beta),sigma=sqrt(sigma.2))

Comparing to `lm` output

The stats

t(sapply(sim,function(x) c(mean=mean(x),sd=sd(x))))

                  mean        sd
X.Intercept. 55.786709 1.9585332
productB      3.159653 2.7552841
sigma        13.736069 0.9932521

Compare closely to lm

mod1 <- lm(sales ~ product, data = df)
summary(mod1)

...
Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   55.780      1.929  28.922   <2e-16 ***
productB       3.180      2.728   1.166    0.246    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 13.64 on 98 degrees of freedom
...

Though you'll note the Bayesian approach tends to be more conservative (larger standard errors).

Simulated points

Simulated predictions for A and B factor levels are

d <- unique(df$product)
mm <- cbind(1,contrasts(d))
sim.y <- crossprod(beta,t(mm))

head(sim.y)
            A        B
[1,] 56.09510 61.45903
[2,] 55.43892 57.87281
[3,] 58.49551 60.59784
[4,] 52.55529 62.71117
[5,] 62.18198 59.27573
[6,] 59.50713 57.39560

Comparison to confidence intervals from `lm`.

We can calculated quantiles on our simulated values for A and B

t(apply(sim.y,2,function(col) quantile(col,c(0.025,0.975))))

      2.5%    97.5%
A 51.90695 59.62353
B 55.14255 62.78334

And compare to confidence intervals from frequentist linear regression

predict(mod1, data.frame(product = c("A", "B")), interval="confidence",level=0.95)

    fit      lwr      upr
1 55.78 51.95266 59.60734
2 58.96 55.13266 62.78734

Data

# those without tidyverse, this will suffice
if (!require("tidyverse")) tibble <- data.frame

set.seed(2017)

df <- tibble(
    product = c(rep("A", 50), rep("B", 50)),
    sales = round(c(rnorm(50, mean = 55, sd = 10), rnorm(50, mean = 60, sd = 15)))
)

Upvotes: 2

Jon Nagra

Reputation: 1660

I believe that if you want to show uncertainty around your forecasts bayesian regression is better fitted than traditional one.

That said, you can get what you want in the following way (you will have to rename the columns of SimulatedMat):

# All the possible combinations of factors
modmat<-unique(model.matrix(sales ~.,df))
# Number of simulations
simulations<-100L
# initialise result matrix
SimulatedMat<-matrix(0,nrow=simulations,ncol=0)

# iterate amongst all combinations of factors
for(i in 1:nrow(modmat)){
  # columns with value one
  selcols<-which(modmat[i,]==1)
  # simulation for the factors with value 1
  simul<-apply(mapply(rnorm,n=simulations,coef(summary(mod1))[selcols, "Estimate"],
       coef(summary(mod1))[selcols, "Std. Error"]),1,sum)
  # incorporate result to the matrix
  SimulatedMat<-cbind(SimulatedMat,simul)
}

Upvotes: 0

gsun

Reputation: 427

For uncertainty of point estimates, 1) If you choose simulation, I would recommend boxplot. 2) If you choose CI, you can manually calculate it, or use predict() such as in Webb's comment, and plot intervals. Here I just show you how to do simulation in a generalized form. You are almost there, so hope this helps.

myfactor_pred<-function(factor,N){
    if(factor==0){
        return(rnorm(N,coef(summary(mod1))[1,1],coef(summary(mod1))[1,2]))
    }else{
        return(rnorm(N,coef(summary(mod1))[1,1],coef(summary(mod1))[1,2])+
            rnorm(N,coef(summary(mod1))[2,1],coef(summary(mod1))[2,2]))
    }
}
A<-myfactor_pred(0,100)#call function and get simulation for A
B<-myfactor_pred(1,100)#call function and get simulation for B
boxplot(data.frame(A,B),xlab="product",ylab="sales")