twgardner2
twgardner2

Reputation: 660

R: %in% Operator with Wildcard/REGEX

I have a data frame with a concatenated string, the last 11 digits of which are a census tract. I have a separate list of strings, in which the last 2 or 5 digits represent states or counties, respectively. I've concatenated a * on the end of the 2 or 5 digit id. I need to go through the data frame and flag whether the trans variable (census tract) is in the patterns vector (state or county), allowing the * to represent the remaining 9 or 6 digits in trans.

As shown in the code below, I've gotten this to work by collapsing all the patterns into a single string with collapse="|", and grepling the two. However, I'm wondering if I can accomplish this through a vector operation because 1) it feels like I should be able to, and 2) in practice, the list of patterns is enormous and it feels foolish to put them into a single character variable.

Is there anything similar to the %in% operator, but with regex/wildcard character support?

library(dplyr)

trans <- c("1-IA-45045000100",
           "2-IA-23003001801",
           "3-LITP-01001000100",
           "4-OTP-06006000606",
           "4-OTP-06010001001",
           "1-IA-45001010002",
           "2-IA-45045000101",
           "2-LITP-23005005002")
df <- data.frame(id = 1:8, trans)

patterns <- c("1-IA-45*",
              "2-LITP-23005*",
              "4-OTP-06*")

# This works, but I'm looking for a better way
patterns_string <- paste(patterns, collapse="|")
df <- df %>% mutate(match = ifelse(grepl(patterns_string, df$trans), TRUE, FALSE))

# Is there anyway to keep the patterns in a vector and check for whether they
# any of them grepl with each row or my data frame or to use %in% with a 
# wildcard character?

# "argument 'pattern' has length > 1 and only first element will be used" 
df <- df %>% mutate(match = ifelse(grepl(patterns, df$trans), TRUE, FALSE))

# Can't take advantage of the 'wild character '*'
df <- df %>% mutate(match = trans %in% patterns)

Upvotes: 0

Views: 1718

Answers (2)

Rich Scriven
Rich Scriven

Reputation: 99331

You could run each pattern through grepl() via lapply(), then use Reduce() with the logical "or" operator | to combine the results.

df$match <- Reduce("|", lapply(patterns, grepl, df$trans))
df
#   id              trans match
# 1  1   1-IA-45045000100  TRUE
# 2  2   2-IA-23003001801 FALSE
# 3  3 3-LITP-01001000100 FALSE
# 4  4  4-OTP-06006000606  TRUE
# 5  5  4-OTP-06010001001  TRUE
# 6  6   1-IA-45001010002  TRUE
# 7  7   2-IA-45045000101 FALSE
# 8  8 2-LITP-23005005002  TRUE

Upvotes: 4

akrun
akrun

Reputation: 887108

Here is an option using tidyverse with stri_detect from stringi

library(stringi)
library(tidyverse)
patterns %>%
      map(~stri_detect_regex(df$trans, .)) %>% 
      reduce(`|`) %>%
      mutate(df, match = .)
#  id              trans match
#1  1   1-IA-45045000100  TRUE
#2  2   2-IA-23003001801 FALSE
#3  3 3-LITP-01001000100 FALSE
#4  4  4-OTP-06006000606  TRUE
#5  5  4-OTP-06010001001  TRUE
#6  6   1-IA-45001010002  TRUE
#7  7   2-IA-45045000101 FALSE
#8  8 2-LITP-23005005002  TRUE

Upvotes: 1

Related Questions