Reputation: 1673
Based on this topic, I have created a function that returns a dataset with variables related to the outcome (y
) by specific linear coefs.
simulate_data_regression <- function(sample=10, coefs=0, error=0){
n_var <- length(coefs)
X <- matrix(0, ncol=n_var, nrow=sample)
beta <- as.matrix(coefs)
for (i in 1:n_var){
X[,i] <- scale(rnorm(sample, 0, 1))
}
y <- X %*% beta
if(error != 0){
y <- y + rnorm(sample, 0, error)
}
data = data.frame(X=X)
names(data) <- paste0("V", 1:n_var)
data$y <- as.vector(y)
return(data)
}
data <- simulate_data_regression(sample=50, coefs=c(0.1, 0.8), error=0)
summary(data)
sd(data$V1)
sd(data$y)
It works great. However, I would need to have a standardized y
(mean 0 and SD 1). But when I try to scale it, the coefficients change:
data <- simulate_data_regression(sample=50, coefs=c(0.1, 0.8), error=0)
data$y <- as.vector(scale(data$y))
coef(lm(y ~ ., data=data))
It is possible to do such thing? Thank you very much!
In other words, I would like the coefs that are specified to be standardized coefs (expressed in outcome's SD).
Scaling y
a posteriori changes the coefs by 1/sd(y)
. However, I can't think of any way to change the betas before generating y, so that the betas return to their specified value after the scaling of y
.
I've tried running the function twice, first extracting sd(y)
and scaling the coefficients with it, in the hope that those scaled coefficients will change to the specified ones once I'll scale y
. But it doens't work, which is expected, as sd(y)
changes when I change the coefs :'(
Here's the failed attempt:
simulate_data_regression <- function(sample=10, coefs=0, error=0, standardized=TRUE){
stuff <- .simulate_data_regression(sample=sample, coefs=coefs, error=error)
if(standardized == TRUE){
y_sd <- sd(data$y)
data <- .simulate_data_regression(sample=sample, coefs=y_sd*coefs, error=error, X=stuff$X)$data
data$y <- as.vector(scale(data$y))
} else{
data <- stuff$data
}
return(data)
}
.simulate_data_regression <- function(sample=10, coefs=0, error=0, X=NULL, y=NULL){
n_var <- length(coefs)
if(is.null(X)){
X <- matrix(0, ncol=n_var, nrow=sample)
for (i in 1:n_var){
X[,i] <- scale(rnorm(sample, 0, 1))
}
}
beta <- as.matrix(coefs)
y <- X %*% beta
if(error != 0){
y <- y + rnorm(sample, 0, error)
}
data = data.frame(X=X)
names(data) <- paste0("V", 1:n_var)
data$y <- as.vector(y)
return(list(X=X, y=y, data=data))
}
Upvotes: 1
Views: 110
Reputation: 76545
If you scale y
the inference will be the same, only the p-values of the intercepts change, not the p-values of the coefficients.
In this example I have set error = 1
.
set.seed(1234) # Make the results reproducible
data <- simulate_data_regression(sample = 50, coefs = c(0.1, 0.8), error = 1)
data2 <- data
data2$y <- scale(data2$y)
fit <- lm(y ~ ., data)
fit2 <- lm(y ~ ., data2)
summary(fit)
summary(fit2)
As you can see the p-values of the coefficients are exactly the same though the coefficients themselves are different. You would expect that since you are scaling by the standard errors of the regressors and therefore the coefficients will be scaled by the inverses of those standard errors.
The version of your function below has an argument, which
, that allows to specify which regressors to scale. Its default is all of them.
simulate_data_regression2 <- function(sample = 10, coefs = 0, error = 0, which = seq_along(coefs)){
n_var <- length(coefs)
X <- matrix(0, ncol=n_var, nrow=sample)
beta <- as.matrix(coefs)
for (i in 1:n_var){
X[,i] <- rnorm(sample, 0, 1)
if(i %in% which) X[, i] <- scale(X[, i])
}
y <- X %*% beta
if(error != 0){
y <- y + rnorm(sample, 0, error)
}
data = data.frame(X=X)
names(data) <- paste0("V", 1:n_var)
data$y <- as.vector(y)
data
}
Now test the function.
set.seed(1234) # Make the results reproducible
data <- simulate_data_regression2(sample=50, coefs=c(0.1, 0.8), error=1)
set.seed(1234) # Reproduce the data generation process
data2 <- simulate_data_regression2(sample=50, coefs=c(0.1, 0.8), error=1, which = 2)
fit <- lm(y ~ ., data)
fit2 <- lm(y ~ ., data2)
As you can see the coefficients of V2
are equal.
coef(fit)
#(Intercept) V1 V2
# 0.01997809 0.19851020 0.96310013
coef(fit2)
#(Intercept) V1 V2
# 0.07040538 0.21130549 0.96310013
The p-values of the estimates of the coefficients V2
are also equal
summary(fit)
summary(fit2)
Upvotes: 2