String replacements: how to deal with similar strings and spaces

Question

Context: translate a table from French to English using a table containing corresponding replacements.

Problem: character strings sometimes are very similar, when white space are involved str_replace() does not consider the whole string.

Reproductible example:

library(stringr)  #needed for the str_replace_all() function

#datasets

# test is the table indicating corresponding strings
test = data.frame(fr = as.character(c("Autre", "Autres", "Autre encore")),
                  en = as.character(c("Other", "Others", "Other again")),
                  stringsAsFactors = FALSE)
# test1 is the table I want to translate
test1 = data.frame(totrans = as.character(c("Autre", "Autres", "Autre encore")),
                   stringsAsFactors = FALSE)

# here is a function to translate
test2 = str_replace_all(test1$totrans, setNames(test$en, test$fr))

Output:

I get

> test2
[1] "Other"        "Others"       "Other encore"

Expected result:

> testexpected
[1] "Other"       "Others"      "Other again"

As you can see, if strings starts the same but there is no whitespace, replacement is a succes (see Other and Others) but when there is a whitespace, it fails ("Autre encore" is replaced by "Other encore" and not by "Other again").

I feel the answer is very obvious but I just can't find out how to solve it... Any suggestion is welcome.

Allan Cameron · Accepted Answer

I think you just need word boundaries (i.e. "\b") around your look ups. It is straightforward to add these with a paste0 call inside str_replace_all.

Note you don't need to include the whole tidyverse for this; the str_replace_all function is part of the stringr package, which is just one of several packages loaded when you call library(tidyverse):

library(stringr) 

test = data.frame(fr = as.character(c("Autre", "Autres", "Autre encore")),
                  en = as.character(c("Other", "Others", "Other again")),
                  stringsAsFactors = FALSE)

test1 = data.frame(totrans = as.character(c("Autre", "Autres", "Autre encore")),
                   stringsAsFactors = FALSE)

str_replace_all(test1$totrans, paste0("\b", test$fr, "\b"), test$en)
#> [1] "Other"       "Others"      "Other again"

^{Created on 2020-05-14 by the reprex package (v0.3.0)}

String replacements: how to deal with similar strings and spaces

Answers (1)

Related Questions