Reputation: 2977
Context: translate a table from French to English using a table containing corresponding replacements.
Problem: character strings sometimes are very similar, when white space are involved str_replace()
does not consider the whole string.
Reproductible example:
library(stringr) #needed for the str_replace_all() function
#datasets
# test is the table indicating corresponding strings
test = data.frame(fr = as.character(c("Autre", "Autres", "Autre encore")),
en = as.character(c("Other", "Others", "Other again")),
stringsAsFactors = FALSE)
# test1 is the table I want to translate
test1 = data.frame(totrans = as.character(c("Autre", "Autres", "Autre encore")),
stringsAsFactors = FALSE)
# here is a function to translate
test2 = str_replace_all(test1$totrans, setNames(test$en, test$fr))
Output:
I get
> test2
[1] "Other" "Others" "Other encore"
Expected result:
> testexpected
[1] "Other" "Others" "Other again"
As you can see, if strings starts the same but there is no whitespace, replacement is a succes (see Other and Others) but when there is a whitespace, it fails ("Autre encore" is replaced by "Other encore" and not by "Other again").
I feel the answer is very obvious but I just can't find out how to solve it... Any suggestion is welcome.
Upvotes: 1
Views: 68
Reputation: 173858
I think you just need word boundaries (i.e. "\\b") around your look ups. It is straightforward to add these with a paste0
call inside str_replace_all
.
Note you don't need to include the whole tidyverse for this; the str_replace_all
function is part of the stringr package, which is just one of several packages loaded when you call library(tidyverse)
:
library(stringr)
test = data.frame(fr = as.character(c("Autre", "Autres", "Autre encore")),
en = as.character(c("Other", "Others", "Other again")),
stringsAsFactors = FALSE)
test1 = data.frame(totrans = as.character(c("Autre", "Autres", "Autre encore")),
stringsAsFactors = FALSE)
str_replace_all(test1$totrans, paste0("\\b", test$fr, "\\b"), test$en)
#> [1] "Other" "Others" "Other again"
Created on 2020-05-14 by the reprex package (v0.3.0)
Upvotes: 2