Reputation: 75
I'd like to create a flag in a data frame (or data table) based on a stringi match to some terms in R.
df = data.frame(text = c("AABA","AACA","AAAA","BAAE","CAAD","CCCC","DDDD","EEEE"))
df$flag[stri_detect_regex(df$text,"AAB|AAC|AAA|BAA|CAA")] = 'Match1'
Showing output like this:
text flag
1 AABA Match1
2 AACA Match1
3 AAAA Match1
4 BAAE Match1
5 CAAD Match1
6 CCCC <NA>
7 DDDD <NA>
8 EEEE <NA>
I then want to check for another pattern:
df$flag[stri_detect_regex(df$text,"CCCC|DDDD")] = 'Match2'
But only run this if flag is NA ie is.na(df$flag). It would also be great to know if how I could include multiple conditions, ie
is.na(df$flag) & df$other_var == 1
The reason I want to do this is I need to review many millions of rows and only want to do the regex on rows that either don't have a flag already and/or include other filter criteria. Thank you for your help!
Upvotes: 3
Views: 68
Reputation: 20811
You could used named capture groups and extract the name
df <- data.frame(text = c("AABA","AACA","AAAA","BAAE","CAAD","CCCC","DDDD","EEEE"))
capture_name <- function(x) {
x <- attr(x, 'capture.start')
ifelse(sum(x) == 0, NA, colnames(x)[x > 0])
}
p <- c(Match1 = 'AAB|AAC|AAA|BAA|CAA',
Match2 = 'CCCC|DDDD')
p <- paste(sprintf('(?<%s>%s)', names(p), p), collapse = '|')
within(df, {
flag <- sapply(gregexpr(p, df$text, perl = TRUE), capture_name)
})
# text flag
# 1 AABA Match1
# 2 AACA Match1
# 3 AAAA Match1
# 4 BAAE Match1
# 5 CAAD Match1
# 6 CCCC Match2
# 7 DDDD Match2
# 8 EEEE <NA>
Upvotes: 2
Reputation: 887173
We can use case_when
library(dplyr)
library(stringi)
df %>%
mutate(flag = case_when(stri_detect_regex(text, "AAB|AAC|AAA|BAA|CAA") ~ "Match1",
stri_detect_regex(text, "CCCC|DDDD") ~ "Match2"))
Upvotes: 3