Reputation:
I am doing linear models in R. My factors include birth rates, death rates, infant mortality rates, life expectancies, and region. region has 7 levels, using numerical numbers to represent each region:
I ran a Lasso Regression in R to try to improve the generalized linear model. The Lasso Regression coefficients is as follows:
I will put the factors selected by Lasso Regression into the lm function in R:
Lasso.lm <- lm(log(GNIpercapita) ~ deathrate + infantdeaths + life.exp.avg +
life.exp.diff + region, data=econdev)
However, for regions, how do I add each region into the linear model lm? For example, regionEast Asia & Pacific
, I can't jut add as + regionEast Asia & Pacific
.
Upvotes: 2
Views: 641
Reputation: 107
I agree with previous comments in that it is not recommended to pick and choose parts of a categorical variable. If you would still like to do it, it is easy using the modeldb package to create dummy variables for each level of your categorical variable. Remember in your regression lm() you have to leave one level of the categorical variable out to avoid perfect collinearity.
library(modeldb)
df %>%
add_dummy_variables(region)
Upvotes: 0
Reputation: 4993
You cannot use pieces and parts of the category.
You can eliminate numerical variables, or entire columns of categorical variables, but you cannot pick and choose individual categories because it fragments your dataframe.
You might be better off to use the outcome of the Lasso Regression itself and predict from it. It is not less of a regression because of the regularization. It is more complex, and more robust and less straight forward, but not 'worse'.
If that does not work for you, then you can run an lm()
with the continuous variables selected and the entire region variable and accept that the model is imperfect as all models are or remove the region and settle for what may be a less predictive model.
Upvotes: 0