Tim
Tim

Reputation: 75

IF filter for regex R

I'd like to create a flag in a data frame (or data table) based on a stringi match to some terms in R.

df = data.frame(text = c("AABA","AACA","AAAA","BAAE","CAAD","CCCC","DDDD","EEEE"))
df$flag[stri_detect_regex(df$text,"AAB|AAC|AAA|BAA|CAA")] = 'Match1'

Showing output like this:

    text    flag
1   AABA    Match1
2   AACA    Match1
3   AAAA    Match1
4   BAAE    Match1
5   CAAD    Match1
6   CCCC    <NA>
7   DDDD    <NA>
8   EEEE    <NA>

I then want to check for another pattern:

df$flag[stri_detect_regex(df$text,"CCCC|DDDD")] = 'Match2'

But only run this if flag is NA ie is.na(df$flag). It would also be great to know if how I could include multiple conditions, ie

is.na(df$flag) & df$other_var == 1 

The reason I want to do this is I need to review many millions of rows and only want to do the regex on rows that either don't have a flag already and/or include other filter criteria. Thank you for your help!

Upvotes: 3

Views: 68

Answers (2)

rawr
rawr

Reputation: 20811

You could used named capture groups and extract the name

df <- data.frame(text = c("AABA","AACA","AAAA","BAAE","CAAD","CCCC","DDDD","EEEE"))

capture_name <- function(x) {
  x <- attr(x, 'capture.start')
  ifelse(sum(x) == 0, NA, colnames(x)[x > 0])
}

p <- c(Match1 = 'AAB|AAC|AAA|BAA|CAA',
       Match2 = 'CCCC|DDDD')
p <- paste(sprintf('(?<%s>%s)', names(p), p), collapse = '|')

within(df, {
  flag <- sapply(gregexpr(p, df$text, perl = TRUE), capture_name)
})

#   text   flag
# 1 AABA Match1
# 2 AACA Match1
# 3 AAAA Match1
# 4 BAAE Match1
# 5 CAAD Match1
# 6 CCCC Match2
# 7 DDDD Match2
# 8 EEEE   <NA>

Upvotes: 2

akrun
akrun

Reputation: 887173

We can use case_when

library(dplyr)
library(stringi)
df %>%
   mutate(flag = case_when(stri_detect_regex(text, "AAB|AAC|AAA|BAA|CAA") ~ "Match1",
                           stri_detect_regex(text, "CCCC|DDDD") ~ "Match2"))

Upvotes: 3

Related Questions