Reputation: 41
I am working with the glmnet package in R and came across a problem when trying to reproduce an "old" classifier. If the explanatory variables are permuted (say in reverse order), the resulting coefficients from cv.glmnet are not equal to the coefficients using the unpermuted design matrix.
For example, consider the following data:
library(glmnet)
set.seed(1)
#Set initial parameters
n <- 100
p <- 1000
#Simulate data
x <- matrix(rnorm(n * p), nrow = n, ncol = p)
colnames(x) <- as.character(1:p)
beta <- rnorm(n = p, mean = 2, sd = 2)
beta[rbinom(p, size = 1, prob = 0.5) == 0] <- 0
y <- x %*% beta + rnorm(100, sd = 0.1)
Then run glmnet with the LASSO penalization (alpha = 1) for both the design matrix x and the permuted version of x.
#Set parameters for cross validation with cv.glmnet
lambda <- exp(seq(-1, 1, length.out = 100))
alpha <- 1
foldid <- rep(1:10, each = 10)
#Run cross validation
fit <- cv.glmnet(x = x, y = y, family = "gaussian", alpha = alpha,
lambda = lambda,
foldid = foldid)
#Save coefficients
coef1 <- as.matrix(coef(fit, s = "lambda.min"))
#Run cross validation with rearranged design matrix
order <- ncol(x):1
fit2 <- cv.glmnet(x = x[,order], y = y, family = "gaussian", alpha = alpha,
lambda = lambda,
foldid = foldid)
#Save coefficients
coef2 <- as.matrix(coef(fit2, s = "lambda.min"))
coef2 <- coef2[rownames(coef1),]
Then compare the coefficients, the cross validation error, and the linear predictors.
> summary(coef2 - coef1)
1
Min. :-0.2738963
1st Qu.: 0.0000000
Median : 0.0000000
Mean : 0.0003739
3rd Qu.: 0.0000000
Max. : 0.3643040
> min(fit$cvm)
[1] 4584.373
> min(fit2$cvm)
[1] 4596.626
> summary(cbind(1,x) %*% coef2 - cbind(1, x) %*% coef1)
1
Min. :-0.5100
1st Qu.:-0.1613
Median : 0.0210
Mean : 0.0000
3rd Qu.: 0.1333
Max. : 0.6139
For all three measures, we see a difference between the models while only the order of the variables has been changed. Can anybody explain this?
Upvotes: 4
Views: 1242
Reputation: 21
Glmnet computes the LASSO regularization paths through coordinate descent (see e.g. slide 15 of this talk by Trevor Hastie: http://web.stanford.edu/~hastie/TALKS/glmnet.pdf). Since the algorithm iterates over the coefficients, the order of the variables affects the path taken. Depending on the convergence threshold and the maximum number of iterations, this can lead to differences in the final values of the coefficients. In the case of your example, try changing
fit <- cv.glmnet(x = x, y = y, family = "gaussian", alpha = alpha,
lambda = lambda,
foldid = foldid)
to e.g.
fit <- cv.glmnet(x = x, y = y, family = "gaussian", alpha = alpha,
lambda = lambda,
foldid = foldid, standardize=TRUE, thresh=1e-20, maxit=10^6)
and do the same for your fit2
. This may take a minute or so to compute, but you will find the differences become negligibly small:
> summary(coef2 - coef1)
1
Min. :-2.038e-08
1st Qu.: 0.000e+00
Median : 0.000e+00
Mean : 1.050e-10
3rd Qu.: 0.000e+00
Max. : 3.028e-08
>
> min(fit$cvm)
[1] 4598.242
>
> min(fit2$cvm)
[1] 4598.242
>
> summary(cbind(1,x) %*% coef2 - cbind(1, x) %*% coef1)
1
Min. :-5.175e-08
1st Qu.:-1.457e-08
Median :-2.959e-10
Mean : 0.000e+00
3rd Qu.: 1.503e-08
Max. : 5.555e-08
Upvotes: 2
Reputation: 126
I believe it is because glmnet uses coordinate descent, where the variables are iterated over to minimize the loss function. The order of the variables in that case would change the order of iteration, which changes the path taken to minimize the loss function.
Upvotes: 1