Reputation: 55
Below is the input string which I need to remove the Latin-1 encoded characters i.e. '\xf0'.
str <- "b'RT @galacticemp: transboundary water idea jam \\xf0\\x9f\\x92\\xa1 \\xf0\\x9f\\x92\\xa6 with @dstgovza @GlobalDevLab #ibmresearchwits @WitsUniversity @IBMResearch @USEmbassySA @\\xe2\\x80\\xa6'"
iconv(x, "latin1", "ASCII", sub="")
I tried many ways but failed to remove or convert.
Here I noticed that, if I have a single backslash, the code works but for double backslash, it fails. Any work around or regex pattern (gsub function) to remove suggestion would help me a lot. Thanks.
Upvotes: 0
Views: 735
Reputation: 37661
If you just want to remove them, you can do that with gsub
gsub("[\x80-\xff]", "", str)
[1] "b'RT @galacticemp: transboundary water idea jam with @dstgovza @GlobalDevLab #ibmresearchwits @WitsUniversity @IBMResearch @USEmbassySA @'"
Just to be clear, what this is doing is replacing any character numbered 128-255 with an empty string.
Edit: Based on the updated information from OP, I now think that the strings do not contain unicode characters, but rather escape codes for unicode characters. Those can be removed in a similar way, but now you have to specify a pattern that will describe those escape codes.
str <- "b'RT @galacticemp: transboundary water idea jam \\xf0\\x9f\\x92\\xa1 \\xf0\\x9f\\x92\\xa6 with @dstgovza @GlobalDevLab #ibmresearchwits @WitsUniversity @IBMResearch @USEmbassySA @\\xe2\\x80\\xa6'"
gsub("\\\\x[89a-f][0-9a-f]", "", str)
[1] "b'RT @galacticemp: transboundary water idea jam with @dstgovza @GlobalDevLab #ibmresearchwits @WitsUniversity @IBMResearch @USEmbassySA @'"
Upvotes: 1