Reputation: 207
I would like to compare models (multiple regression, LASSO, Ridge, GBM) in terms of the importance of variables. But I'm not sure if the procedure is correct, because the values obtained are not on the same scale.
In multiple regression and GBM values range from 0 - 100 using varImp from the caret package. The calculation of this statistic is distinct in each of the methods.
Linear Models: the absolute value of the t-statistic for each model parameter is used.
Boosted Trees: this method uses the same approach as a single tree, but sums the importance of each boosting iteration.
While for LASSO and Ridge the values are from 0.00 - 0.99, calculated with the function:
varImp <- function (object, lambda = NULL, ...) {
beta <- predict (object, s = lambda, type = "coef")
if (is.list (beta)) {
out <- do.call ("cbind", lapply (beta, function (x)
x [, 1])))
out <- as.data.frame (out)
} else
out <- data.frame (Overall = beta [, 1])
out <- abs (out [rownames (out)! = "(Intercept)",, drop = FALSE])
out
}
Which was obtained here: Caret package - glmnet variable importance
I was guided by other questions on the forum, but could not understand why there is the difference between the scales. How can I make these measurements comparable?
Upvotes: 0
Views: 343
Reputation: 4993
If the goal is simply to compare them side-by-side, then what matters is creating a scale that they can all inhabit together, and sorting them.
You can accomplish this by creating a standardized scale, and coercing all of your VarImps to the new consistent scale, in this case 0 to 100.
importance_data <- c(-23,12, 32, 18, 45, 1, 77, 18, 22)
new_scale <- function(x){
y =((100-0)/(max(x) -min(x))*(x-max(x))+100)
sort(y)
}
new_scale(importance_data)
#results
[1] 0 24 35 41 41 45 55 68 100
This will give you a uniform scale. And it does not mean that 22 in one scale is exactly the same as a 22 in another scale. But for relative comparison, any scale will do.
This will give you a standardized sense of the separation between the importance of each variable in its own model and you can evaluate them side-by-side more easily based on the relativity of the scaled importances.
Upvotes: 1