Reputation: 431
I made a code to simulate a dataset in R to see how backward selection works in machine learning. And I generated poly() function to write polynomial function and then wanted to choose the suitable polynomial using Cp, BIC, adjusted R^2.
The code is:
###Generating dataset
set.seed(1)
X = rnorm(100)
eps = rnorm(100)
beta0 = 3
beta1 = 2
beta2 = -3
beta3 = 0.3
Y = beta0 + beta1 * X + beta2 * X^2 + beta3 * X^3 + eps
library(leaps)
data.full = data.frame(y = Y, x = X)
mod.full = regsubsets(y ~ poly(x, 10, raw = T), data = data.full, nvmax = 10)
mod.summary = summary(mod.full)
### Find the model size for best cp, BIC and adjr2
which.min(mod.summary$cp)
For cp, BIC and adjusted R^2 I get model with polynomial 3 as it should be
However, now I want to simulate 100 datasets and see in how many datasets do I get the right model. I simulated 100 datasets but now I am not getting polynomial 3 for each of the measures. And I don't quite understand what I'm doing wrong. My code for simulation is:
###Generating 100 datasets
data <- replicate(100, rnorm(n=100))
epsilon <- replicate(100,rnorm(n=100))
###Formula (same as before)
Y = beta0 + beta1 * data + beta2 * data^2 + beta3 * data^3 + epsilon
data.full = data.frame(y = Y, x = data)
###Using polynomial terms
mod.bwd = regsubsets(data.full$y.1 ~ poly(data.full$x.1, 10, raw = T), data = data.full, nvmax = 10,
method = "backward")
bwd.summary = summary(mod.bwd)
which.min(bwd.summary$cp)
which.min(bwd.summary$bic)
which.max(bwd.summary$adjr2)
For a given subset cp, Bic, adjr2 are giving me different results. For example, using y.1 and x.1 (first dataset in simulation) gives following results:
which.min(bwd.summary$cp): 7
which.min(bwd.summary$bic): 4
which.max(bwd.summary$adjr2): 9
Can someone help me what I'm doing wrong in simulating these 100 datasets.
Upvotes: 0
Views: 215
Reputation: 2535
If I've read your code correctly you run the model on the same simulated dataset 100 times instead of all 100 simulated datasets, this should do the trick:
set.seed(42)
###Generating 100 datasets
data <- replicate(100, rnorm(n=100))
epsilon <- replicate(100,rnorm(n=100))
###Formula (same as before)
Y = beta0 + beta1 * data + beta2 * data^2 + beta3 * data^3 + epsilon
data.full = data.frame(y = Y, x = data)
res <- lapply(1:100, function(i){
###Using polynomial terms
mod.bwd = regsubsets(data.full[[i]] ~ poly(data.full[[100+i]], 10, raw = T), data = data.full, nvmax = 10,
method = "backward")
bwd.summary = summary(mod.bwd)
c(
which.min(bwd.summary$cp),
which.min(bwd.summary$bic),
which.max(bwd.summary$adjr2)
)
})
res <- do.call(rbind, res)
With this rng-seed this gives some lines where all cirteria select the correct model.
Upvotes: 1