acm_myk
acm_myk

Reputation: 23

How do I create a new single column based on multiple regex against multiple other columns?

I have a dataframe with multiple columns of strings. I want to

  1. look at all strings in a series of these columns
  2. see if any strings match an input pattern, ideally regex
  3. create a single new binary column that is equal to 1 or TRUE if there are any matches across all columns for a given row, and 0 or FALSE if there are no matches whatsoever.

My question has two parts.

First, I am using sapply and grepl and have managed to create a dataframe of TRUE/FALSE values for all of the columns in question, but am having trouble figuring out how to get to the "create a new single column" step. I tried rowwise mutate but that's throwing errors that I honestly don't understand!

df <- data.frame(
  idx = 1:5,
  column_b = letters[1:5],
  column_c = c('abc', 'abc', 'def', 'def', 'ghi'), 
  column_d = c('def', 'def', 'def', 'def', 'def'),
  column_e = c('ghi', 'ghi', 'ghi', 'abc', 'ghi')
)

apply_factor <- function(df, factor, col_low, col_high, pattern) {
   df %>%
   rowwise() %>%
   mutate(factor = sum(c_across(as.data.frame(sapply(select(df, {{col_low}}:{{col_high}}), grepl, pattern={{pattern}})))), na.rm = TRUE)
}

apply_factor(df, factor = 'abc', 'column_c', 'column_e', pattern = "^abc")

(double curly braces {{}} in the function due to something or other about dplyr I saw in another question...)

Console responds:

Error in `mutate()`:
! Problem while computing `factor = sum(...)`.
i The error occurred in row 1.
Caused by error in `as_indices_impl()`:
! Must subset columns with a valid subscript vector.
x Subscript has the wrong type `data.frame<
  column_c: logical
  column_d: logical
  column_e: logical
>`.
i It must be numeric or character.

If I run just the as.data.frame() part it gives me something that looks pretty good. I just need to make it into a new column!

> as.data.frame(sapply(select(df, 'column_c':'column_e'), grepl, pattern = "^abc"))

  column_c column_d column_e
1     TRUE    FALSE    FALSE
2     TRUE    FALSE    FALSE
3    FALSE    FALSE    FALSE
4    FALSE    FALSE     TRUE
5    FALSE    FALSE    FALSE

Second, I need to extend this function to a much larger dataset with many millions of rows and between 30 and 100 string columns. The large and variable number of columns, by the way, is why I need this function to be able to take any range of columns instead of listing out every possible column. Since processing time will be somewhat of a factor: is what I am doing here going to be massively inefficient? Is there an easier, faster way to accomplish this?

I've tried variations on sum, summarise, and any, but honestly it's very likely I did it wrong.

I also looked at this post, but I can't limit to using %in% as I need regex (I'm pretty sure I can't use regex with %in%...)

(I am relearning R after a long stint away in Stata-land, so please forgive any really obvious blunders or terminology weirdness!)

Upvotes: 2

Views: 101

Answers (1)

Darren Tsai
Darren Tsai

Reputation: 35594

grepl is vectorized, so you can use it with if_any(), which is much faster than rowwise.

df %>%
  mutate(abc = if_any(column_c:column_e, ~ grepl("^abc", .x)))

#   idx column_b column_c column_d column_e   abc
# 1   1        a      abc      def      ghi  TRUE
# 2   2        b      abc      def      ghi  TRUE
# 3   3        c      def      def      ghi FALSE
# 4   4        d      def      def      abc  TRUE
# 5   5        e      ghi      def      ghi FALSE

To wrap the code in a function:

ifany_grepl <- function(data, col, lower, upper, pattern) {
  data %>%
    mutate(!!sym(col) := if_any({{lower}}:{{upper}}, ~ grepl(pattern, .x)))
}

ifany_grepl(df, 'abc', 'column_c', 'column_e', "^abc")

Upvotes: 1

Related Questions