Reputation: 21
In R, I have strings that have encoded junk within, such as
"based on the unique spectral \xfc\xbe\x8e\x93\xa0\xbc\xfc\xbe\x98\xa6\x90\xbc\xfc\xbe\x99\xa6\x8c\xbcfingerprints\xfc\xbe\x8e\x93\xa0\xbc of their biochemical composition"
Is there an easy way to strip the string of the encoded junk, regardless of what the junk is?
Upvotes: 2
Views: 442
Reputation: 47
i have same problem. I got data from meteostation in .dta format which is something like .csv with metadata. I do not know the encoding of the document but in R which running in UTF8 i got same rubbish as you. I identified Czech language characters in it, which is the place where station works. I used this code. eg.
gsub(x = data, pattern = regex("\xfc\xbe\x8c\x96\x94\xbc"), replacement = "a")
All wrong encoded characters have same pattern \xfc\xbe\something\something\something\xbc. In code here it is replacement for long a (á).
If you just want to get rid of it, function str_extract
from stringr
package works well for me.
Upvotes: 0
Reputation: 174696
Use gsub
x <- "based on the unique spectral \xfc\xbe\x8e\x93\xa0\xbc\xfc\xbe\x98\xa6\x90\xbc\xfc\xbe\x99\xa6\x8c\xbcfingerprints\xfc\xbe\x8e\x93\xa0\xbc of their biochemical composition"
gsub("[^[:print:]]", "", x)
# [1] "based on the unique spectral fingerprints of their biochemical composition"
Upvotes: 4