How do I detect the presence of a pattern in a string, for a fixed set of patterns?

Question

In the context of variable selection, I'm trying to count the number of times a variable is selected over bootstrapped iterations. A simple version of the problem is provided below, along with my solution (answer). But my solution quickly becomes unwieldy when dealing with 50 or 100 variables.

I have the set of variable names I would like to count over (pred) so I thought it should be possible to create new columns based on those values and then detect the relevant string for each. But I can't figure out how without manually setting the column names and pasting the function. There must be a better way...

Any other solutions would be welcome, including tidyverse or purrr...

library(dplyr)

df <- mtcars
n <- nrow(df)
pred <- colnames(df)[2:length(df)]
target <- "mpg"
mpg_formula <- paste(target, "~", paste(pred, collapse = "+"))

steplm <- data.frame()

bootnum <- 10

for (i in 1:bootnum) {
  message("Fitting model ", i, " out of ", bootnum)
  data.id <- sample(1:dim(df)[1], replace = T)
  fit.lms <- step(lm(mpg_formula, data=df[data.id, ]), 
                  direction = "backward",
                  trace = 0)
  selected.vars <- paste(sort(names(coef(fit.lms)[-1])), collapse = ", ")
  step.result <- data.frame("model" = selected.vars,
                            "nvar" = length(names(coef(fit.lms)[-1])))
  steplm <- dplyr::bind_rows(steplm, step.result)
}

steplm %>%
  transmute(
      steplm %>%
  transmute(
      cyl = grepl(pattern = "cyl",  x = model),
     disp = grepl(pattern = "disp", x = model),
       hp = grepl(pattern = "hp",   x = model),
     drat = grepl(pattern = "drat", x = model),
       wt = grepl(pattern = "wt",   x = model),
     qsec = grepl(pattern = "qsec", x = model),
       vs = grepl(pattern = "vs",   x = model),
       am = grepl(pattern = "am",   x = model),
     gear = grepl(pattern = "gear", x = model),
     carb = grepl(pattern = "carb", x = model)
  ) -> answer

This produces the following data.frame (or matrix), from which I can just sum the columns to get the values I want (or do matrix operations to get pairwise and joint dependencies between terms). This is just to point out the matrix format is needed for the next step...

     cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
    TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE
   FALSE  TRUE FALSE  TRUE FALSE FALSE  TRUE  TRUE FALSE  TRUE
    TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE
    TRUE FALSE FALSE  TRUE  TRUE  TRUE FALSE  TRUE FALSE  TRUE
    TRUE  TRUE FALSE FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE
    TRUE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE  TRUE FALSE
   FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE  TRUE
   FALSE  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
    TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
    TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE

Ronak Shah · Accepted Answer

You can use sapply :

sapply(pred, grepl, steplm$model)

#        cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
# [1,] FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE  TRUE FALSE FALSE
# [2,]  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE  TRUE
# [3,] FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE  TRUE FALSE  TRUE
# [4,] FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE FALSE
# [5,] FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE
# [6,]  TRUE FALSE  TRUE FALSE  TRUE  TRUE FALSE FALSE  TRUE FALSE
# [7,] FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE  TRUE  TRUE
# [8,] FALSE  TRUE FALSE  TRUE  TRUE  TRUE FALSE FALSE  TRUE FALSE
# [9,] FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE
#[10,]  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE

sapply returns a matrix. You can wrap data.frame to sapply output if you need dataframe.

identical(data.frame(sapply(pred, grepl, steplm$model)), answer)
#[1] TRUE

How do I detect the presence of a pattern in a string, for a fixed set of patterns?

Answers (2)

Related Questions