jbaums
jbaums

Reputation: 27388

Replace matches according to the pattern that was matched

Given a set of regular expressions, is there a simple way to match multiple patterns, and replace the matched text according to the pattern that was matched?

For example, for the following data x, each element begins with either a number or a letter, and ends with either a number or a letter. Let's call these patterns num_num (for begins with number, ends with number), num_let (begins with number, ends with letter), let_num, and let_let.

x <- c('123abc', '78fdsaq', 'aq12111', '1p33', '123', 'pzv')
type <- list(
  num_let='^\\d.*[[:alpha:]]$',
  num_num='^\\d(.*\\d)?$',
  let_num='^[[:alpha:]].*\\d$',
  let_let='^[[:alpha:]](.*[[:alpha:]])$'
)

To replace each string with the name of the pattern it follows, we could do:

m <- lapply(type, grep, x)
rep(names(type), sapply(m, length))[order(unlist(m))]
## [1] "num_let" "num_let" "let_num" "num_num" "num_num" "let_let"

Is there a more efficient approach?


gsubfn?

I know that with gsubfn we can simultaneously replace different matches, e.g.:

library(gsubfn)
gsubfn('.*', list('1p33'='foo', '123abc'='bar'), x)
## [1] "bar"     "78fdsaq" "aq12111" "foo"     "123"     "pzv"

but I'm not sure whether the replacements can be made dependent on the pattern that was matched rather than on the match itself.


stringr?

str_replace_all doesn't play nicely with this example, since matches are replaced for patterns iteratively, and we end up with everything being overwritten with let_let:

library(stringr)
str_replace_all(x, setNames(names(type), unlist(type)))
## [1] "let_let" "let_let" "let_let" "let_let" "let_let" "let_let"

Reordering type so the pattern corresponding to let_let appears first solves the problem, but needing to do this makes me nervous.

type2 <- rev(type)
str_replace_all(x, setNames(names(type2), unlist(type2)))
## [1] "num_let" "num_let" "let_num" "num_num" "num_num" "let_let"

Upvotes: 10

Views: 407

Answers (2)

NGaffney
NGaffney

Reputation: 1532

stringr

We can use str_replace_all if we alter the replacements so they are no longer matched by any of the regular expressions and then add an additional replacement to return them to their original form. For example

library(stringr)
type2 <- setNames(c(str_replace(names(type), "(.*)", "__\\1__"), "\\1"), 
                  c(unlist(type), "^__(.*)__$"))
str_replace_all(x, type2)
## [1] "num_let" "num_let" "let_num" "num_num" "num_num" "let_let"

grepl and tidyr

Another approach is match first and then replace, one way to do this is to use grepl and tidyr

library(plyr)
library(dplyr)
library(tidyr)

out <- data.frame(t(1*aaply(type, 1, grepl, x)))

out[out == 0] <- NA
out <- out %>% 
  mutate(id = 1:nrow(.)) %>%
  gather(name,value, -id, na.rm = T) %>%
  select(name)
as.character(out[,1])
## [1] "num_let" "num_let" "num_num" "num_num" "let_num" "let_let"

While this approach doesn't look as efficient it makes it easy to find rows where there are more or less than one match.


From what I understand substitution matching is implemented in pcre2 and I believe allows this type of problem to be solved directly in the regex. Unfortunately it seems that no one has built a pcre2 package for R yet.

Upvotes: 1

WaltS
WaltS

Reputation: 5520

Perhaps one of these.

# base R method
mm2 <- character(length(x))
for( n in 1:length(type))  mm2 <- replace(mm2, grep(type[n],x), names(type)[n]) 

# purrr 0.2.0 method
library(purrr)
mm3 <- map(grep, .x=type, x = x) %>% (function(z) replace(x, flatten_int(z), rep(names(type), lengths(z))))

The base R method is somewhat faster than the posted code for both small and larger data sets. The purrr method is slower than the posted code for small data sets but about the same as the base R method for larger data sets.

Upvotes: 2

Related Questions