Rishi
Rishi

Reputation: 55

How to remove or convert Latin-1 encoded characters in R?

Below is the input string which I need to remove the Latin-1 encoded characters i.e. '\xf0'.

str <- "b'RT @galacticemp: transboundary water idea jam \\xf0\\x9f\\x92\\xa1 \\xf0\\x9f\\x92\\xa6 with @dstgovza @GlobalDevLab #ibmresearchwits @WitsUniversity @IBMResearch @USEmbassySA @\\xe2\\x80\\xa6'"

iconv(x, "latin1", "ASCII", sub="")
I tried many ways but failed to remove or convert.

Here I noticed that, if I have a single backslash, the code works but for double backslash, it fails. Any work around or regex pattern (gsub function) to remove suggestion would help me a lot. Thanks.

Upvotes: 0

Views: 735

Answers (1)

G5W
G5W

Reputation: 37661

If you just want to remove them, you can do that with gsub

gsub("[\x80-\xff]", "", str)
[1] "b'RT @galacticemp: transboundary water idea jam   with @dstgovza @GlobalDevLab #ibmresearchwits @WitsUniversity @IBMResearch @USEmbassySA @'"

Just to be clear, what this is doing is replacing any character numbered 128-255 with an empty string.

Edit: Based on the updated information from OP, I now think that the strings do not contain unicode characters, but rather escape codes for unicode characters. Those can be removed in a similar way, but now you have to specify a pattern that will describe those escape codes.

str <- "b'RT @galacticemp: transboundary water idea jam \\xf0\\x9f\\x92\\xa1 \\xf0\\x9f\\x92\\xa6 with @dstgovza @GlobalDevLab #ibmresearchwits @WitsUniversity @IBMResearch @USEmbassySA @\\xe2\\x80\\xa6'"
gsub("\\\\x[89a-f][0-9a-f]", "", str)
[1] "b'RT @galacticemp: transboundary water idea jam   with @dstgovza @GlobalDevLab #ibmresearchwits @WitsUniversity @IBMResearch @USEmbassySA @'"

Upvotes: 1

Related Questions