R - How to simplify this text clean-up of special characters?

Question

I suspect there is a way to simplify this text pre-preprocessing. However, I could not find a solution how to merge all these character replacements into a single row. Hence, to avoid all the repetition in my current solution (see below):

Encoding(posts2$caption_clean) <- "UTF-8"
posts2$caption_clean <- iconv(posts2$caption_clean, "latin1", "UTF-8")
posts2$caption_clean <- gsub("Ã\S*","",posts2$caption_clean) 
posts2$caption_clean <- gsub("â\S*","",posts2$caption_clean)
posts2$caption_clean <- gsub("ð\S*","",posts2$caption_clean)
posts2$caption_clean <- gsub("Â\S*","",posts2$caption_clean) 
posts2$caption_clean <- gsub("å\S*","",posts2$caption_clean)
posts2$caption_clean <- gsub("Ð\S*","",posts2$caption_clean)
posts2$caption_clean <- gsub("Ñ\S*","",posts2$caption_clean)
posts2$caption_clean <- gsub("Ù\S*","",posts2$caption_clean)
posts2$caption_clean <- gsub("Ø\S*","",posts2$caption_clean) 
posts2$caption_clean <- gsub("Ú\S*","",posts2$caption_clean) 
posts2$caption_clean <- gsub("ì\S*","",posts2$caption_clean) 
posts2$caption_clean <- gsub("Õ\S*","",posts2$caption_clean) 
posts2$caption_clean <- gsub("ã\S*","",posts2$caption_clean) 
posts2$caption_clean <- gsub("Û\S*","",posts2$caption_clean) 
posts2$caption_clean <- gsub("ë\S*","",posts2$caption_clean)
posts2$caption_clean <- gsub("ê\S*","",posts2$caption_clean)
posts2$caption_clean <- gsub("è¿½\S*","",posts2$caption_clean)

Does anyone know how I can simplify this?

Thanks!

Radim · Accepted Answer

# construct regex where each target pattern is a group ()
# enclose groups in [] to target any of those groups

regex <- "[(Ã\S*)(â\S*)(ð\S*)]" 
string <- "Ã  x â x ð y "
gsub(regex, "", string)

result:

[1] "  x  x  y "

R - How to simplify this text clean-up of special characters?

Answers (1)

Related Questions