maycca
maycca

Reputation: 4090

R dplyr: Filter data by multiple Regex expressions defined by vector

I have a dataframe, from which I want to select important columns, and then filter the rows to contain specific ending.

Regex expression make it simple to define my ending value using xx$ symbol. But, how to vary over multiple possible endings (xx$, yy$)?

Dummy example:

require(dplyr)

x <- c("aa", "aa", "aa", "bb", "cc", "cc", "cc")
y <- c(101, 102, 113, 201, 202, 344, 407)
type = rep("zz", 7)
df = data.frame(x, y, type)    

# Select all expressions that starts end by "7"
df %>%
  select(x, y) %>%
  filter(grepl("7$", y))

# It seems working when I explicitly define my variables, but I need to use it as a vector instead of values?
df %>%
  select(x, y) %>%
  filter(grepl("[2|7]$", y))  # need to modify this using multiple endings


# How to modify this expression, to use vector of endings (ids) instead?
ids = c(7,2)     # define vector of my values

df %>%
     select(x, y) %>%
     filter(grepl("ids$", y))  # how to change "grepl(ids, y)??"

Expected output:

   x   y type
1 aa 102   zz
2 cc 202   zz
3 cc 407   zz

Example based on this question: Regular expressions (RegEx) and dplyr::filter()

Upvotes: 3

Views: 2053

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626845

You may use

df %>% 
  select(x, y) %> filter(grepl(paste0("(?:", paste(ids, collapse="|"), ")$"), y))

The paste0("(?:", paste(ids, collapse="|"), ")$") part will build an alternation pattern that will only match at the end of the string due to $ anchor at the end.

NOTE: If the values can have special regex metacharacters you need to escape the values in the character vector first:

regex.escape <- function(string) {
  gsub("([][{}()+*^$|\\\\?.])", "\\\\\\1", string)
}
df %>% 
      select(x, y) %> filter(grepl(paste0("(?:", paste(regex.escape(ids), collapse="|"), ")$"), y))
                                                       ^^^^^^^^^^^^^^^^^

For example, paste0("(?:", paste(c("7", "8", "ids"), collapse="|"), ")$") will output (?:7|8|ids)$:

  • (?: - start of a non-capturing group that will act as a container for the alternatives, so that the $ anchor applied to all of them and not to just the last one, matching any of
    • 7 - a 7 char
  • | - or
  • 8 - an 8 char
  • | - or
  • ids - an ids substring
  • ) - end of the group
  • $ - end of the string.

Upvotes: 5

Related Questions