Dominique Makowski
Dominique Makowski

Reputation: 1673

Simulate data for regression with standardized Y

Based on this topic, I have created a function that returns a dataset with variables related to the outcome (y) by specific linear coefs.

simulate_data_regression <- function(sample=10, coefs=0, error=0){

  n_var <- length(coefs)
  X <- matrix(0, ncol=n_var, nrow=sample)

  beta <- as.matrix(coefs)

  for (i in 1:n_var){
    X[,i] <- scale(rnorm(sample, 0, 1))
  }

  y <- X %*% beta

  if(error != 0){
    y <- y + rnorm(sample, 0, error)
  }


  data = data.frame(X=X)
  names(data) <- paste0("V", 1:n_var)
  data$y <- as.vector(y)

  return(data)
}

data <- simulate_data_regression(sample=50, coefs=c(0.1, 0.8), error=0)
summary(data)
sd(data$V1)
sd(data$y)

It works great. However, I would need to have a standardized y (mean 0 and SD 1). But when I try to scale it, the coefficients change:

data <- simulate_data_regression(sample=50, coefs=c(0.1, 0.8), error=0)
data$y <- as.vector(scale(data$y))
coef(lm(y ~ ., data=data))

It is possible to do such thing? Thank you very much!


Edit

In other words, I would like the coefs that are specified to be standardized coefs (expressed in outcome's SD).

Scaling y a posteriori changes the coefs by 1/sd(y). However, I can't think of any way to change the betas before generating y, so that the betas return to their specified value after the scaling of y.


Edit 2: Failed attempt

I've tried running the function twice, first extracting sd(y) and scaling the coefficients with it, in the hope that those scaled coefficients will change to the specified ones once I'll scale y. But it doens't work, which is expected, as sd(y) changes when I change the coefs :'(

Here's the failed attempt:

simulate_data_regression <- function(sample=10, coefs=0, error=0, standardized=TRUE){

  stuff <- .simulate_data_regression(sample=sample, coefs=coefs, error=error)
  if(standardized == TRUE){
    y_sd <- sd(data$y)
    data <- .simulate_data_regression(sample=sample, coefs=y_sd*coefs, error=error, X=stuff$X)$data
    data$y <- as.vector(scale(data$y))
  } else{
    data <- stuff$data
  }
  return(data)
}


.simulate_data_regression <- function(sample=10, coefs=0, error=0, X=NULL, y=NULL){

  n_var <- length(coefs)

  if(is.null(X)){
    X <- matrix(0, ncol=n_var, nrow=sample)
    for (i in 1:n_var){
      X[,i] <- scale(rnorm(sample, 0, 1))
    }
  }

  beta <- as.matrix(coefs)
  y <- X %*% beta

  if(error != 0){
    y <- y + rnorm(sample, 0, error)
  }


  data = data.frame(X=X)
  names(data) <- paste0("V", 1:n_var)
  data$y <- as.vector(y)

  return(list(X=X, y=y, data=data))
}

Upvotes: 1

Views: 110

Answers (1)

Rui Barradas
Rui Barradas

Reputation: 76545

If you scale y the inference will be the same, only the p-values of the intercepts change, not the p-values of the coefficients.
In this example I have set error = 1.

set.seed(1234)    # Make the results reproducible
data <- simulate_data_regression(sample = 50, coefs = c(0.1, 0.8), error = 1)
data2 <- data
data2$y <- scale(data2$y)

fit <- lm(y ~ ., data)
fit2 <- lm(y ~ ., data2)

summary(fit)
summary(fit2)

As you can see the p-values of the coefficients are exactly the same though the coefficients themselves are different. You would expect that since you are scaling by the standard errors of the regressors and therefore the coefficients will be scaled by the inverses of those standard errors.

The version of your function below has an argument, which, that allows to specify which regressors to scale. Its default is all of them.

simulate_data_regression2 <- function(sample = 10, coefs = 0, error = 0, which = seq_along(coefs)){
  n_var <- length(coefs)
  X <- matrix(0, ncol=n_var, nrow=sample)
  beta <- as.matrix(coefs)
  for (i in 1:n_var){
    X[,i] <- rnorm(sample, 0, 1)
    if(i %in% which) X[, i] <- scale(X[, i])
  }
  y <- X %*% beta
  if(error != 0){
    y <- y + rnorm(sample, 0, error)
  }
  data = data.frame(X=X)
  names(data) <- paste0("V", 1:n_var)
  data$y <- as.vector(y)
  data
}

Now test the function.

set.seed(1234)    # Make the results reproducible
data <- simulate_data_regression2(sample=50, coefs=c(0.1, 0.8), error=1)

set.seed(1234)    # Reproduce the data generation process
data2 <- simulate_data_regression2(sample=50, coefs=c(0.1, 0.8), error=1, which = 2)

fit <- lm(y ~ ., data)
fit2 <- lm(y ~ ., data2)

As you can see the coefficients of V2 are equal.

coef(fit)
#(Intercept)          V1          V2 
# 0.01997809  0.19851020  0.96310013

coef(fit2)
#(Intercept)          V1          V2 
# 0.07040538  0.21130549  0.96310013

The p-values of the estimates of the coefficients V2 are also equal

summary(fit)
summary(fit2)

Upvotes: 2

Related Questions