Simulating homogeneity of variance in linear regression

Question

I want to visualize some assumptions on regression theory. Starting point is this population and linear regression:

set.seed(1234)
runifdisc <- function(n, min = 0, max = 1) sample(min:max, n, replace = T)
x1 <- runifdisc(100000, 1, 10)
e <- runifdisc(100000, 0, 20)
y <- 4 + 2 * x1 + e
dat_pop <- data.frame(x1, e, y)

m_pop <- lm(y ~ x1, data = dat_pop)

Since x1 has 10 values, m_pop leads to 10 different predictions for y given x.

> table(round(predict(m_pop)))

   16    18    20    22    24    26    28    30    32    34 
10096 10081  9864 10078  9927  9914  9915 10124 10018  9983

If I am not mistaken, the predictions of y given x1 should have the same variance for each specific value of x1 in a large number of samples. An assumption that can also be applied to the residuals. However, in my code, the variance varies with the different y values given x1:

n <- 1000

vec_y1 <- rep(0, n)
vec_y2 <- rep(0, n)
vec_y3 <- rep(0, n)
vec_y4 <- rep(0, n)
vec_y5 <- rep(0, n)
vec_y6 <- rep(0, n)
vec_y7 <- rep(0, n)
vec_y8 <- rep(0, n)
vec_y9 <- rep(0, n)
vec_y10 <- rep(0, n)

#Draw 1000 samples from dat_pop; in each sample, save the prediction of y given x1
#in vectors vec_y1-vec_y10.
for (i in 1:n){
  s <- dat_pop[sample(nrow(dat_pop), 1000), ]
  m <- lm(y ~ x1, data = s)

  #Prediction for y given x1 == 1
  vec_y1[i] <- m$coefficients[1] +  1 * m$coefficients[2]
  #Prediction for y given x1 == 2
  vec_y2[i] <- m$coefficients[1] +  2 * m$coefficients[2]
  #Prediction for y given x1 == 3
  vec_y3[i] <- m$coefficients[1] +  3 * m$coefficients[2]
  #Prediction for y given x1 == ...
  vec_y4[i] <- m$coefficients[1] +  4 * m$coefficients[2]
  vec_y5[i] <- m$coefficients[1] +  5 * m$coefficients[2]
  vec_y6[i] <- m$coefficients[1] +  6 * m$coefficients[2]
  vec_y7[i] <- m$coefficients[1] +  7 * m$coefficients[2]
  vec_y8[i] <- m$coefficients[1] +  8 * m$coefficients[2]
  vec_y9[i] <- m$coefficients[1] +  9 * m$coefficients[2]
  vec_y10[i] <- m$coefficients[1] +  10 * m$coefficients[2]
}

#Variance of different predictions for y given x1 in the samples above.
#This variance should be equal for all vectors vec_y1-vec_y10.
var(vec_y1)
var(vec_y3)
var(vec_y5)
var(vec_y8)
var(vec_y10)

The variance is larger for the lower and upper values of x1.

> var(vec_y1)
[1] 0.1234933
> var(vec_y3)
[1] 0.06295427
> var(vec_y5)
[1] 0.03637214
> var(vec_y8)
[1] 0.06016804
> var(vec_y10)
[1] 0.118478

My question addresses, on the one hand, my understanding of the assumption from regression theory. Perhaps there is a misunderstanding on my side. On the other hand, the question is about the code that produces the same variance for all y given x1.

Simulating homogeneity of variance in linear regression

Answers (1)

Related Questions