biomiha
biomiha

Reputation: 1422

str_replace_all replacing named vector elements iteratively not all at once

Let's say I have a long character string: pneumonoultramicroscopicsilicovolcanoconiosis. I'd like to use stringr::str_replace_all to replace certain letters with others. According to the documentation, str_replace_all can take a named vector and replaces the name with the value. That works fine for 1 replacement, but for multiple it seems to do it iteratively, so the result is a replacement of the prelast iteration. I'm not sure this is the intended behaviour.

library(tidyverse)
text_string = "developer"
text_string %>% 
  str_replace_all(c(e ="X")) #this works fine
[1] "dXvXlopXr"
text_string %>% 
  str_replace_all(c(e ="p", p = "e")) #not intended behaviour
[1] "develoeer"

Desired result:

[1] "dpvploepr"

Which I get by introducing a new character:

text_string %>% 
  str_replace_all(c(e ="X", p = "e", X = "p"))

It's a usable workaround but hardly generalisable. Is this a bug or are my expectations wrong?

I'd like to also be able to replace n letters with n other letters simultaneously, preferably using either two vectors (like "old" and "new") or a named vector as input.

reprex edited for easier human reading

Upvotes: 8

Views: 3625

Answers (4)

Mark
Mark

Reputation: 4537

2023 Update

Back when I first answered this I had a thrown together R package that was just on my github. Since then, I've refined it substantially and it's now on CRAN and even used in other packages.

The readme and CRAN documentation spells all this out, but I understand how helpful code is on this page. The updated usage is based on passing in vectors of patterns and replacements. There's a recycle option that will allow you to supply a replacement list that's shorter than the pattern list and just keep cycling through it. You can also pass arguments to regexpr in the backend (e.g. fixed=TRUE)

install.packages('mgsub')
mgsub("developer", 
      pattern = c("e", "p"), 
      replacements = c("p", "e"))
#> [1] "dpvploepr"

Original Answer

I'm working on a package to deal with the type of problem. This is safer than the qdap::mgsub function because it does not rely on placeholders. It fully supports regex as the matching and the replacement. You provide a named list where the names are the strings to match on and their value is the replacement.

devtools::install_github("bmewing/mgsub")
library(mgsub)
mgsub("developer",list("e" ="p", "p" = "e"))
#> [1] "dpvploepr"

qdap::mgsub(c("e","p"),c("p","e"),"developer")
#> [1] "dpvploppr"

Upvotes: 9

TJ Mahr
TJ Mahr

Reputation: 3954

The iterative behavior is intended. That said, we can use write our own workaround. I am going to use character subsetting for the replacement.

In a named vector, we can look up things by name and get a replacement value for each name. This is like doing all the replacement simultaneously.

rules <- c(a = "X", b = "Y", X = "a")
chars <- c("a", "a", "b", "X", "X")
rules[chars]
#>   a   a   b   X   X 
#> "X" "X" "Y" "a" "a"

So here, looking up "a" in the rules vector gets us "X", effectively replacing "a" with "X". The same goes for the other characters.

One problem is that names without a match yield NA.

rules <- c(a = "X", b = "Y", X = "a")
chars <- c("a", "Y", "Z")
rules[chars]
#>    a <NA> <NA> 
#>  "X"   NA   NA

To prevent the NAs from appearing, we can expand the rules to include any new characters so that a character is replaced by itself.

rules <- c(a = "X", b = "Y", X = "a")
chars <- c("a", "Y", "Z")
no_rule <- chars[! chars %in% names(rules)]
rules2 <- c(rules, setNames(no_rule, no_rule))
rules2[chars]
#>   a   Y   Z 
#> "X" "Y" "Z"

And that's the logic behind the following function.

  • Break strings to characters
  • Create a full list of replacement rules
  • Look up replacement values
  • Glue strings back together
library(stringr)

str_replace_chars <- function(string, rules) {
  # Expand rules to replace characters with themselves 
  # if those characters do not have a replacement rule
  chars <- unique(unlist(strsplit(string, "")))
  complete_rules <- setNames(chars, chars)
  complete_rules[names(rules)] <- rules

  # Split each string into characters, replace and unsplit
  for (string_i in seq_along(string)) {
    chars_i <- unlist(strsplit(string[string_i], ""))
    string[string_i] <- paste0(complete_rules[chars_i], collapse = "")
  }
  string
}

rules <- c(a = "X", p = "e", e = "p")
string <- c("application", "developer")
str_replace_chars(string, rules)
#> [1] "XeelicXtion" "dpvploepr"

Upvotes: 1

Benjamin Schwetz
Benjamin Schwetz

Reputation: 643

My workaround would be to take advantage of the fact that str_replace_all can take functions as an input for the replacement.

library(stringr)
text_string = "developer"
pattern <- "p|e"
fun <- function(query) {
    if(query == "e") y <- "p"
    if(query == "p") y <- "e"
    return(y)
}

str_replace_all(text_string, pattern, fun)

Of course, if you need to scale up, I would suggest to use a more sophisticated function.

Upvotes: 2

MrSmithGoesToWashington
MrSmithGoesToWashington

Reputation: 1076

There is probably an order in what the function does, so after replacing all c by s, you replace all s by c, only c remains .. try this :

long_string %>% str_replace_all(c(c ="X", s = "U"))  %>% str_replace_all(c(X ="s", U = "c"))

Upvotes: 1

Related Questions