ZTraveler
ZTraveler

Reputation: 1

h2o variable importance for GLM models in R

I ran a GLM model using h2o in r (the model was saved as an object called mod), the dataset contains categorical and continuous predictors. I called the h2o.varimp function on mod to retrieve the variable importance for this GLM model, for categorical predictors, it seemed to calculate the relative importance for each level within that same categorical predictor. Is there a way for me to aggregate the importance across all levels so that I get a single importance metric for each unique predictor?

I first tried summing the relative importance but since they are based on standardized coefficients, I'm not sure if this is the right way.

Upvotes: 0

Views: 116

Answers (2)

Wendy
Wendy

Reputation: 301

Another reason that only standardized coefficients are used in the calculation of variable importance is due to the scaling of the different predictors. A predictor A that is in the range of 0 to 1 and another predictor B is in the range of 10000-200000. I know that both predictors contribute equally to the final model output.

For example, A = 0.5, B = 20000, y = 40000. Then, in this case, we will have 40000A+1B = 40000. Hence, the variable importance without standardization will give you the impression that predictor A is much more important than predictor B which is incorrect. They participated equally in generating the model output and the two should have the same variable importance.

Upvotes: 0

Tomáš Frýda
Tomáš Frýda

Reputation: 591

I think more appropriate is to do a weighted average since categorical variable has always only one value at a time and I would weigh it by the relative frequency of each level - imagine you have a very rare level that influences the target variable a lot. If you’d use average, this rare level would influence the relative importance more than it should.

Another possibility is to use SHAP via predict_contributions and set output_format = “compact”. This will give you coefficients multiplied by average value from each variable from the background_set. These contributions then sum up to the prediction (might be in link space so unless you’re using Gaussian regression, you might want to se output_space=TRUE). You can take the mean absolute value to get some feature importance.

Upvotes: 0

Related Questions