Reputation: 27
I suspect there is a way to simplify this text pre-preprocessing. However, I could not find a solution how to merge all these character replacements into a single row. Hence, to avoid all the repetition in my current solution (see below):
Encoding(posts2$caption_clean) <- "UTF-8"
posts2$caption_clean <- iconv(posts2$caption_clean, "latin1", "UTF-8")
posts2$caption_clean <- gsub("Ã\\S*","",posts2$caption_clean)
posts2$caption_clean <- gsub("â\\S*","",posts2$caption_clean)
posts2$caption_clean <- gsub("ð\\S*","",posts2$caption_clean)
posts2$caption_clean <- gsub("Â\\S*","",posts2$caption_clean)
posts2$caption_clean <- gsub("å\\S*","",posts2$caption_clean)
posts2$caption_clean <- gsub("Ð\\S*","",posts2$caption_clean)
posts2$caption_clean <- gsub("Ñ\\S*","",posts2$caption_clean)
posts2$caption_clean <- gsub("Ù\\S*","",posts2$caption_clean)
posts2$caption_clean <- gsub("Ø\\S*","",posts2$caption_clean)
posts2$caption_clean <- gsub("Ú\\S*","",posts2$caption_clean)
posts2$caption_clean <- gsub("ì\\S*","",posts2$caption_clean)
posts2$caption_clean <- gsub("Õ\\S*","",posts2$caption_clean)
posts2$caption_clean <- gsub("ã\\S*","",posts2$caption_clean)
posts2$caption_clean <- gsub("Û\\S*","",posts2$caption_clean)
posts2$caption_clean <- gsub("ë\\S*","",posts2$caption_clean)
posts2$caption_clean <- gsub("ê\\S*","",posts2$caption_clean)
posts2$caption_clean <- gsub("追\\S*","",posts2$caption_clean)
Does anyone know how I can simplify this?
Thanks!
Upvotes: 0
Views: 349
Reputation: 455
# construct regex where each target pattern is a group ()
# enclose groups in [] to target any of those groups
regex <- "[(Ã\\S*)(â\\S*)(ð\\S*)]"
string <- "Ã x â x ð y "
gsub(regex, "", string)
result:
[1] " x x y "
Upvotes: 1