Reputation: 599
I would like to perform automated, exhaustive model selection on a dataset with 7 predictors (5 continuous and 2 categorical) in R. I would like all continuous predictors to have the potential for interaction (at least up to 3 way interactions) and also have non-interacting squared terms.
I have been using regsubsets()
from the leaps
package and have gotten good results, however many of the models contain interaction terms without including the main effects as well (e.g., g*h
is an included model predictor but g
is not). Since inclusion of the main effect as well will affect the model score (Cp, BIC, etc) it is important to include them in comparisons with the other models even if they are not strong predictors.
I could manually weed through the results and cross off models that include interactions without main effects but I'd prefer to have an automated way to exclude those. I'm fairly certain this isn't possible with regsubsets()
or leaps()
, and probably not with glmulti
either. Does anyone know of another exhaustive model selection function that allows for such specification or have a suggestion for script that will sort the model output and find only models that fit my specs?
Below is simplified output from my model searches with regsubsets()
. You can see that model 3 and 4 do include interaction terms without including all the related main effects. If no other functions are known for running a search with my specs then suggestions on easily sub-setting this output to exclude models without the necessary main effects included would be helpful.
Model adjR2 BIC CP n_pred X.Intercept. x1 x2 x3 x1.x2 x1.x3 x2.x3 x1.x2.x3
1 0.470344346 -41.26794246 94.82406866 1 TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
2 0.437034361 -36.5715963 105.3785057 1 TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
3 0.366989617 -27.54194252 127.5725366 1 TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
4 0.625478214 -64.64414719 46.08686422 2 TRUE TRUE FALSE FALSE FALSE FALSE FALSE TRUE
Upvotes: 1
Views: 3267
Reputation: 599
After working with dredge
I found that my models have too many predictors and interactions to run dredge in a reasonable period (I calculated that with 40+ potential predictors it might take 300k hours to complete the search on my computer). But it does exclude models where interactions don't match with main effects so I imagine that might still be a good solution for many people.
For my needs I've moved back to regsubsets
and have written some code to parse through the search output in order to exclude models that contain terms in interactions that are not included as main effects. This code seems to work well so I'll share it here. Warning: it was written with human expediency in mind, not computational, so it could probably be re-coded to be faster. If you've got 100,000s of models to test you might want to make it sleeker. (I've been working on searches with ~50,000 models and up to 40 factors which take my 2.4ghz i5 core a few hours to process)
reg.output.search.with.test<- function (search_object) { ## input an object from a regsubsets search
## First build a df listing model components and metrics of interest
search_comp<-data.frame(R2=summary(search_object)$rsq,
adjR2=summary(search_object)$adjr2,
BIC=summary(search_object)$bic,
CP=summary(search_object)$cp,
n_predictors=row.names(summary(search_object)$which),
summary(search_object)$which)
## Categorize different types of predictors based on whether '.' is present
predictors<-colnames(search_comp)[(match("X.Intercept.",names(search_comp))+1):dim(search_comp)[2]]
main_pred<-predictors[grep(pattern = ".", x = predictors, invert=T, fixed=T)]
higher_pred<-predictors[grep(pattern = ".", x = predictors, fixed=T)]
## Define a variable that indicates whether model should be reject, set to FALSE for all models initially.
search_comp$reject_model<-FALSE
for(main_eff_n in 1:length(main_pred)){ ## iterate through main effects
## Find column numbers of higher level ters containing the main effect
search_cols<-grep(pattern=main_pred[main_eff_n],x=higher_pred)
## Subset models that are not yet flagged for rejection, only test these
valid_model_subs<-search_comp[search_comp$reject_model==FALSE,]
## Subset dfs with only main or higher level predictor columns
main_pred_df<-valid_model_subs[,colnames(valid_model_subs)%in%main_pred]
higher_pred_df<-valid_model_subs[,colnames(valid_model_subs)%in%higher_pred]
if(length(search_cols)>0){ ## If there are higher level pred, test each one
for(high_eff_n in search_cols){ ## iterate through higher level pred.
## Test if the intxn effect is present without main effect (working with whole column of models)
test_responses<-((main_pred_df[,main_eff_n]==FALSE)&(higher_pred_df[,high_eff_n]==TRUE))
valid_model_subs[test_responses,"reject_model"]<-TRUE ## Set reject to TRUE where appropriate
} ## End high_eff for
## Transfer changes in reject to primary df:
search_comp[row.names(valid_model_subs),"reject_model"]<-valid_model_subs[,"reject_model"
} ## End if
} ## End main_eff for
## Output resulting table of all models named for original search object and current time/date in folder "model_search_reg"
current_time_date<-format(Sys.time(), "%m_%d_%y at %H_%M_%S")
write.table(search_comp,file=paste("./model_search_reg/",paste(current_time_date,deparse(substitute(search_object)),
"regSS_model_search.csv",sep="_"),sep=""),row.names=FALSE, col.names=TRUE, sep=",")
} ## End reg.output.search.with.test fn
Upvotes: 2
Reputation: 13280
You can use the dredge()
function from the MuMIn package.
See also Subsetting in dredge (MuMIn) - must include interaction if main effects are present .
Upvotes: 2