sprinklesx
sprinklesx

Reputation: 83

gsub() not recognizing and replacing certain accented characters

I have a df with various names, many of which contain accented/non-English characters. I used gsub's for each of the characters I wanted to replace, and that worked for many of them; however, for several of the characters, it did not replace them at all.

An example of the non-working gsub: gsub("č","c",df,fixed=TRUE)

Here are the characters that were not replaced: ł ř ń š ž Ľ ţ ę č ć

My wish is to replace them with their English "look-alike" equivalent: l r n s z L t e c c

In addition to the gsub attempts, I have also tried using chartr("łřńšžĽţęčć","lrnszLtecc",df$Name). Like the gsub attempts, this ended in failure as well.

df<-data.frame(Name=c("Stipe Miočić","Duško Todorović","Michał Oleksiejczuk","Jiři Prochazka","Bartosz Fabiński","Damir Hadžović","Ľudovit Klein","Diana Belbiţă","Joanna Jędrzejczyk" ))

Above is a df with several of the names that are giving me trouble, the problem is, when you run this and view the resulting df it removes all of the characters that are giving me problems and shows English versions of those characters. However, it does not do this in my main df I'm working on with directly scraped data.

Any insight into this problem and how to resolve it would be greatly appreciated.

Upvotes: 2

Views: 407

Answers (2)

Ryszard Czech
Ryszard Czech

Reputation: 18611

Use stringi::stri_trans_general:

library(stringi)
df<-data.frame(Name=c("Stipe Miočić","Duško Todorović","Michał Oleksiejczuk","Jiři Prochazka","Bartosz Fabiński","Damir Hadžović","Ľudovit Klein","Diana Belbiţă","Joanna Jędrzejczyk" ))
stri_trans_general(df$Name, "Latin-ASCII")

Results:

[1] "Stipe Miocic"        "Dusko Todorovic"     "Michal Oleksiejczuk"
[4] "Jiri Prochazka"      "Bartosz Fabinski"    "Damir Hadzovic"     
[7] "Ludovit Klein"       "Diana Belbita"       "Joanna Jedrzejczyk" 

See R proof.

Upvotes: 2

Ian Campbell
Ian Campbell

Reputation: 24790

You can use stringi::replace_all_fixed:

library(stringi)
stri_replace_all_fixed(df$Name,
                       c("ł","ř","ń","š","ž","Ľ","ţ","ę","č","ć"),
                       c("l","r","n","s","z","L","t","e","c","c"),
                       vectorize_all = FALSE)
[1] "Stipe Miocic"        "Dusko Todorovic"     "Michal Oleksiejczuk" "Jiri Prochazka"      "Bartosz Fabinski"   
[6] "Damir Hadzovic"      "Ludovit Klein"       "Diana Belbită"       "Joanna Jedrzejczyk" 

Upvotes: 1

Related Questions