Roger J Bos CFA
Roger J Bos CFA

Reputation: 504

glmnet model performance compared to boosting algorithms

To help learn about machine learning I am coding up some examples, not to show that one method is better than another, but to illustrate how to use various functions and what parameters to tune. I started with this blog that compared BooST and xgboost, then I successfully added gbm to the example. Now I am trying to add glmnet, but the model returned alwyas has (near) zero for both of the coefficients. Either I am doing something wrong, or glmnet is just not the correct algo for this data. I am trying to figure out which it is. Here is my reproducible example:

# Uncomment the following 2 lines if you need to install BooST (requires devtools)
#library(devtools)
#install_github("gabrielrvsc/BooST")

library(BooST)
library(xgboost)
library(gbm)
library(glmnet)

# Data generating process
dgp = function(N, r2){
  X = matrix(rnorm(N*2,0,1),N,2)
  X[,ncol(X)] = base::sample(c(0,1),N,replace=TRUE)
  aux = X
  yaux = cos(pi*(rowSums(X)))
  vyaux = var(yaux)
  ve = vyaux*(1-r2)/r2
  e = rnorm(N,0,sqrt(ve))
  y = yaux+e
  return(list(y = y, X = X))
}

# Real data
x1r = rep(seq(-4,4,length.out = 1000), 2)
x2r = c(rep(0,1000), rep(1,1000))
yr = cos(pi*(x1r+x2r))
real_function = data.frame(x1 = x1r, x2 = as.factor(x2r), y = yr)

# Train data (noisy)
set.seed(1)
data = dgp(N = 1000, r2 = 0.5)
y = data$y
x = data$X

# Test data (noisy)
set.seed(2)
dataout=dgp(N = 1000, r2 = 0.5)
yout = dataout$y
xout = dataout$X

# Set seed and train all 4 models
set.seed(1)
BooST_Model = BooST(x, y, v = 0.18, M = 300 , display = TRUE)
xgboost_Model = xgboost(x, label = y, nrounds = 300, params = list(eta = 0.14, max_depth = 2))
gbm_Model = gbm.fit(x, y, distribution = "gaussian", n.trees = 10000, shrinkage = .001, interaction.depth=5)
glmnet_Model = cv.glmnet(x, y, family = "gaussian", alpha=0)
coef(glmnet_Model)

coef(glmnet_Model)

3 x 1 sparse Matrix of class "dgCMatrix" 1

(Intercept) 0.078072154632597062784427066617354284971952438

V1 -0.000000000000000000000000000000000000003033534

V2 -0.000000000000000000000000000000000000044661342

# Predict from test data
p_BooST = predict(BooST_Model, xout)
p_xgboost = predict(xgboost_Model, xout)
p_gbm = predict(gbm_Model, xout, n.trees=10000)
p_glmnet = predict(glmnet_Model, xout)

# Show RMSE
sqrt(mean((p_BooST - yout)^2))
sqrt(mean((p_xgboost - yout)^2))
sqrt(mean((p_gbm - yout)^2))
sqrt(mean((p_glmnet - yout)^2))

fitted = data.frame(x1 = x[,1], x2 = as.factor(x[,2]),
  BooST = fitted(BooST_Model),
  xgboost = predict(xgboost_Model, x),
  gbm = predict(object = gbm_Model, newdata = x, n.trees = 10000),
  glmnet = predict(glmnet_Model, newx = x, s=glmnet_Model$lambda.min)[, 1], y = y)

# Plot noisy Y
ggplot() + geom_point(data = fitted, aes(x = x1, y = y, color = x2)) + geom_line(data = real_function, aes(x = x1, y = y, linetype = x2))

# Plot xgboost
ggplot() + geom_point(data = fitted, aes(x = x1, y = y), color = "gray") + geom_point(data = fitted, aes(x = x1, y = xgboost, color = x2)) + geom_line(data = real_function, aes(x = x1, y = y, linetype = x2))

# Plot BooST
ggplot() + geom_point(data = fitted, aes(x = x1, y = y), color = "gray") + geom_point(data = fitted, aes(x = x1, y = BooST, color = x2)) + geom_line(data = real_function, aes(x = x1, y = y, linetype = x2))

# Plot gbm
ggplot() + geom_point(data = fitted, aes(x = x1, y = y), color = "gray") + geom_point(data = fitted, aes(x = x1, y = gbm, color = x2)) + geom_line(data = real_function, aes(x = x1, y = y, linetype = x2))

# Plot glmnet
ggplot() + geom_point(data = fitted, aes(x = x1, y = y), color = "gray") + geom_point(data = fitted, aes(x = x1, y = glmnet, color = x2)) + geom_line(data = real_function, aes(x = x1, y = y, linetype = x2))

Upvotes: 2

Views: 472

Answers (2)

desertnaut
desertnaut

Reputation: 60319

Either I am doing something wrong

You are not, at least programming-wise

or glmnet is just not the correct algo for this data

It is not that glmnet is "not correct" (although it is supposed to be primarily used for problems with a lot of predictors, not just a couple); it is that your comparison is fundamentally "unfair" and not appropriate: all the other 3 algorithms you use are ensemble ones - your gbm, for instance, consists of ten thousand (10,000) individual decision trees...! Attempting to compare this with a single regressor, such as glmnet, well, it is almost like comparing apples to oranges...

Nevertheless, this should be a good exercise & reminder that, the fact that all these tools, from a programming perspective, seem "equivalent" ("well, I just use library() to load each of them, right? So, why they shouldn't be kind of equivalent and comparable?"), hides a lot... And that's why at least an elementary familiarity with the principles of statistical learning is always a good idea (I strongly suggest the freely available Introduction to Statistical Learning, for starters - including also R code snippets).

The ensemble method of Adaboost, in particular (which is the unifying element behind all the other 3 algorithms you are using here), is no joke! It was a real game changer when it came out (roughly 10 years before the era of deep learning), and, in the xgboost implementation, it is still the winning choice in most Kaggle competitions that involve "traditional" structured data (i.e. no text or images).

Upvotes: 1

Hong Ooi
Hong Ooi

Reputation: 57686

Remember that glmnet fits a linear model, meaning the response can be written as a linear combination of the predictors:

 y = b0 + b1*x1 + b2*x2 + ...

In your dataset, you define the response as, variously

yaux = cos(pi*(rowSums(X)))

yr = cos(pi*(x1r+x2r))

which in both cases is clearly not a linear combination of the predictors.

Upvotes: 1

Related Questions