Shafa Haider
Shafa Haider

Reputation: 431

Simulating datasets in R for model selection

I made a code to simulate a dataset in R to see how backward selection works in machine learning. And I generated poly() function to write polynomial function and then wanted to choose the suitable polynomial using Cp, BIC, adjusted R^2.

The code is:

 ###Generating dataset
 set.seed(1)
 X = rnorm(100)
 eps = rnorm(100)

 beta0 = 3
 beta1 = 2
 beta2 = -3
 beta3 = 0.3
 Y = beta0 + beta1 * X + beta2 * X^2 + beta3 * X^3 + eps
 
library(leaps)
data.full = data.frame(y = Y, x = X)
mod.full = regsubsets(y ~ poly(x, 10, raw = T), data = data.full, nvmax = 10)
mod.summary = summary(mod.full)

### Find the model size for best cp, BIC and adjr2
which.min(mod.summary$cp)

For cp, BIC and adjusted R^2 I get model with polynomial 3 as it should be

However, now I want to simulate 100 datasets and see in how many datasets do I get the right model. I simulated 100 datasets but now I am not getting polynomial 3 for each of the measures. And I don't quite understand what I'm doing wrong. My code for simulation is:

###Generating 100 datasets
data <- replicate(100, rnorm(n=100))
epsilon <- replicate(100,rnorm(n=100))

###Formula (same as before)
 Y = beta0 + beta1 * data + beta2 * data^2 + beta3 * data^3 + epsilon
 data.full = data.frame(y = Y, x = data)

 ###Using polynomial terms
 mod.bwd = regsubsets(data.full$y.1 ~ poly(data.full$x.1, 10, raw = T), data = data.full, nvmax = 10, 
method = "backward")
bwd.summary = summary(mod.bwd)

which.min(bwd.summary$cp)
which.min(bwd.summary$bic)
which.max(bwd.summary$adjr2)

For a given subset cp, Bic, adjr2 are giving me different results. For example, using y.1 and x.1 (first dataset in simulation) gives following results:

which.min(bwd.summary$cp): 7

which.min(bwd.summary$bic): 4

which.max(bwd.summary$adjr2): 9

Can someone help me what I'm doing wrong in simulating these 100 datasets.

Upvotes: 0

Views: 215

Answers (1)

snaut
snaut

Reputation: 2535

If I've read your code correctly you run the model on the same simulated dataset 100 times instead of all 100 simulated datasets, this should do the trick:

set.seed(42)

###Generating 100 datasets
data <- replicate(100, rnorm(n=100))
epsilon <- replicate(100,rnorm(n=100))

###Formula (same as before)
 Y = beta0 + beta1 * data + beta2 * data^2 + beta3 * data^3 + epsilon
 data.full = data.frame(y = Y, x = data)


res <- lapply(1:100, function(i){
 ###Using polynomial terms
 mod.bwd = regsubsets(data.full[[i]] ~ poly(data.full[[100+i]], 10, raw = T), data = data.full, nvmax = 10, 
method = "backward")
bwd.summary = summary(mod.bwd)

c(
which.min(bwd.summary$cp),
which.min(bwd.summary$bic),
which.max(bwd.summary$adjr2)
)
})

res <- do.call(rbind, res)

With this rng-seed this gives some lines where all cirteria select the correct model.

Upvotes: 1

Related Questions