Reputation: 294
I want to visualize some assumptions on regression theory. Starting point is this population and linear regression:
set.seed(1234)
runifdisc <- function(n, min = 0, max = 1) sample(min:max, n, replace = T)
x1 <- runifdisc(100000, 1, 10)
e <- runifdisc(100000, 0, 20)
y <- 4 + 2 * x1 + e
dat_pop <- data.frame(x1, e, y)
m_pop <- lm(y ~ x1, data = dat_pop)
Since x1 has 10 values, m_pop leads to 10 different predictions for y given x.
> table(round(predict(m_pop)))
16 18 20 22 24 26 28 30 32 34
10096 10081 9864 10078 9927 9914 9915 10124 10018 9983
If I am not mistaken, the predictions of y given x1 should have the same variance for each specific value of x1 in a large number of samples. An assumption that can also be applied to the residuals. However, in my code, the variance varies with the different y values given x1:
n <- 1000
vec_y1 <- rep(0, n)
vec_y2 <- rep(0, n)
vec_y3 <- rep(0, n)
vec_y4 <- rep(0, n)
vec_y5 <- rep(0, n)
vec_y6 <- rep(0, n)
vec_y7 <- rep(0, n)
vec_y8 <- rep(0, n)
vec_y9 <- rep(0, n)
vec_y10 <- rep(0, n)
#Draw 1000 samples from dat_pop; in each sample, save the prediction of y given x1
#in vectors vec_y1-vec_y10.
for (i in 1:n){
s <- dat_pop[sample(nrow(dat_pop), 1000), ]
m <- lm(y ~ x1, data = s)
#Prediction for y given x1 == 1
vec_y1[i] <- m$coefficients[1] + 1 * m$coefficients[2]
#Prediction for y given x1 == 2
vec_y2[i] <- m$coefficients[1] + 2 * m$coefficients[2]
#Prediction for y given x1 == 3
vec_y3[i] <- m$coefficients[1] + 3 * m$coefficients[2]
#Prediction for y given x1 == ...
vec_y4[i] <- m$coefficients[1] + 4 * m$coefficients[2]
vec_y5[i] <- m$coefficients[1] + 5 * m$coefficients[2]
vec_y6[i] <- m$coefficients[1] + 6 * m$coefficients[2]
vec_y7[i] <- m$coefficients[1] + 7 * m$coefficients[2]
vec_y8[i] <- m$coefficients[1] + 8 * m$coefficients[2]
vec_y9[i] <- m$coefficients[1] + 9 * m$coefficients[2]
vec_y10[i] <- m$coefficients[1] + 10 * m$coefficients[2]
}
#Variance of different predictions for y given x1 in the samples above.
#This variance should be equal for all vectors vec_y1-vec_y10.
var(vec_y1)
var(vec_y3)
var(vec_y5)
var(vec_y8)
var(vec_y10)
The variance is larger for the lower and upper values of x1.
> var(vec_y1)
[1] 0.1234933
> var(vec_y3)
[1] 0.06295427
> var(vec_y5)
[1] 0.03637214
> var(vec_y8)
[1] 0.06016804
> var(vec_y10)
[1] 0.118478
My question addresses, on the one hand, my understanding of the assumption from regression theory. Perhaps there is a misunderstanding on my side. On the other hand, the question is about the code that produces the same variance for all y given x1.
Upvotes: 0
Views: 126
Reputation: 389
I think this will help to solve it. It took me a while to see it...
The variance is smallest at 5 because it is the mean of your model (x1 values spans from 0 to 10) If you change the model to eg. 1:20, then 10 will have the minimal variance.
You train the model x times in the for loop. every time the predicted slope changes a bit, but most at the ends (1 and 10). And that is the reason behind it. The regression will alway go through the center.
Below are two test samples (two iterations from the loop).
Upvotes: 1