Reputation: 504
To help learn about machine learning I am coding up some examples, not to show that one method is better than another, but to illustrate how to use various functions and what parameters to tune. I started with this blog that compared BooST and xgboost, then I successfully added gbm to the example. Now I am trying to add glmnet, but the model returned alwyas has (near) zero for both of the coefficients. Either I am doing something wrong, or glmnet is just not the correct algo for this data. I am trying to figure out which it is. Here is my reproducible example:
# Uncomment the following 2 lines if you need to install BooST (requires devtools)
#library(devtools)
#install_github("gabrielrvsc/BooST")
library(BooST)
library(xgboost)
library(gbm)
library(glmnet)
# Data generating process
dgp = function(N, r2){
X = matrix(rnorm(N*2,0,1),N,2)
X[,ncol(X)] = base::sample(c(0,1),N,replace=TRUE)
aux = X
yaux = cos(pi*(rowSums(X)))
vyaux = var(yaux)
ve = vyaux*(1-r2)/r2
e = rnorm(N,0,sqrt(ve))
y = yaux+e
return(list(y = y, X = X))
}
# Real data
x1r = rep(seq(-4,4,length.out = 1000), 2)
x2r = c(rep(0,1000), rep(1,1000))
yr = cos(pi*(x1r+x2r))
real_function = data.frame(x1 = x1r, x2 = as.factor(x2r), y = yr)
# Train data (noisy)
set.seed(1)
data = dgp(N = 1000, r2 = 0.5)
y = data$y
x = data$X
# Test data (noisy)
set.seed(2)
dataout=dgp(N = 1000, r2 = 0.5)
yout = dataout$y
xout = dataout$X
# Set seed and train all 4 models
set.seed(1)
BooST_Model = BooST(x, y, v = 0.18, M = 300 , display = TRUE)
xgboost_Model = xgboost(x, label = y, nrounds = 300, params = list(eta = 0.14, max_depth = 2))
gbm_Model = gbm.fit(x, y, distribution = "gaussian", n.trees = 10000, shrinkage = .001, interaction.depth=5)
glmnet_Model = cv.glmnet(x, y, family = "gaussian", alpha=0)
coef(glmnet_Model)
coef(glmnet_Model)
3 x 1 sparse Matrix of class "dgCMatrix" 1
(Intercept) 0.078072154632597062784427066617354284971952438
V1 -0.000000000000000000000000000000000000003033534
V2 -0.000000000000000000000000000000000000044661342
# Predict from test data
p_BooST = predict(BooST_Model, xout)
p_xgboost = predict(xgboost_Model, xout)
p_gbm = predict(gbm_Model, xout, n.trees=10000)
p_glmnet = predict(glmnet_Model, xout)
# Show RMSE
sqrt(mean((p_BooST - yout)^2))
sqrt(mean((p_xgboost - yout)^2))
sqrt(mean((p_gbm - yout)^2))
sqrt(mean((p_glmnet - yout)^2))
fitted = data.frame(x1 = x[,1], x2 = as.factor(x[,2]),
BooST = fitted(BooST_Model),
xgboost = predict(xgboost_Model, x),
gbm = predict(object = gbm_Model, newdata = x, n.trees = 10000),
glmnet = predict(glmnet_Model, newx = x, s=glmnet_Model$lambda.min)[, 1], y = y)
# Plot noisy Y
ggplot() + geom_point(data = fitted, aes(x = x1, y = y, color = x2)) + geom_line(data = real_function, aes(x = x1, y = y, linetype = x2))
# Plot xgboost
ggplot() + geom_point(data = fitted, aes(x = x1, y = y), color = "gray") + geom_point(data = fitted, aes(x = x1, y = xgboost, color = x2)) + geom_line(data = real_function, aes(x = x1, y = y, linetype = x2))
# Plot BooST
ggplot() + geom_point(data = fitted, aes(x = x1, y = y), color = "gray") + geom_point(data = fitted, aes(x = x1, y = BooST, color = x2)) + geom_line(data = real_function, aes(x = x1, y = y, linetype = x2))
# Plot gbm
ggplot() + geom_point(data = fitted, aes(x = x1, y = y), color = "gray") + geom_point(data = fitted, aes(x = x1, y = gbm, color = x2)) + geom_line(data = real_function, aes(x = x1, y = y, linetype = x2))
# Plot glmnet
ggplot() + geom_point(data = fitted, aes(x = x1, y = y), color = "gray") + geom_point(data = fitted, aes(x = x1, y = glmnet, color = x2)) + geom_line(data = real_function, aes(x = x1, y = y, linetype = x2))
Upvotes: 2
Views: 472
Reputation: 60319
Either I am doing something wrong
You are not, at least programming-wise
or glmnet is just not the correct algo for this data
It is not that glmnet is "not correct" (although it is supposed to be primarily used for problems with a lot of predictors, not just a couple); it is that your comparison is fundamentally "unfair" and not appropriate: all the other 3 algorithms you use are ensemble ones - your gbm
, for instance, consists of ten thousand (10,000) individual decision trees...! Attempting to compare this with a single regressor, such as glmnet, well, it is almost like comparing apples to oranges...
Nevertheless, this should be a good exercise & reminder that, the fact that all these tools, from a programming perspective, seem "equivalent" ("well, I just use library()
to load each of them, right? So, why they shouldn't be kind of equivalent and comparable?"), hides a lot... And that's why at least an elementary familiarity with the principles of statistical learning is always a good idea (I strongly suggest the freely available Introduction to Statistical Learning, for starters - including also R code snippets).
The ensemble method of Adaboost, in particular (which is the unifying element behind all the other 3 algorithms you are using here), is no joke! It was a real game changer when it came out (roughly 10 years before the era of deep learning), and, in the xgboost
implementation, it is still the winning choice in most Kaggle competitions that involve "traditional" structured data (i.e. no text or images).
Upvotes: 1
Reputation: 57686
Remember that glmnet fits a linear model, meaning the response can be written as a linear combination of the predictors:
y = b0 + b1*x1 + b2*x2 + ...
In your dataset, you define the response as, variously
yaux = cos(pi*(rowSums(X))) yr = cos(pi*(x1r+x2r))
which in both cases is clearly not a linear combination of the predictors.
Upvotes: 1