GC9000
GC9000

Reputation: 21

R: Remove all encoded text from a string

In R, I have strings that have encoded junk within, such as

"based on the unique spectral \xfc\xbe\x8e\x93\xa0\xbc\xfc\xbe\x98\xa6\x90\xbc\xfc\xbe\x99\xa6\x8c\xbcfingerprints\xfc\xbe\x8e\x93\xa0\xbc of their biochemical composition"

Is there an easy way to strip the string of the encoded junk, regardless of what the junk is?

Upvotes: 2

Views: 442

Answers (2)

Gray Jackal
Gray Jackal

Reputation: 47

i have same problem. I got data from meteostation in .dta format which is something like .csv with metadata. I do not know the encoding of the document but in R which running in UTF8 i got same rubbish as you. I identified Czech language characters in it, which is the place where station works. I used this code. eg.

gsub(x = data, pattern = regex("\xfc\xbe\x8c\x96\x94\xbc"), replacement = "a")

All wrong encoded characters have same pattern \xfc\xbe\something\something\something\xbc. In code here it is replacement for long a (á).

If you just want to get rid of it, function str_extract from stringr package works well for me.

Upvotes: 0

Avinash Raj
Avinash Raj

Reputation: 174696

Use gsub

x <- "based on the unique spectral \xfc\xbe\x8e\x93\xa0\xbc\xfc\xbe\x98\xa6\x90\xbc\xfc\xbe\x99\xa6\x8c\xbcfingerprints\xfc\xbe\x8e\x93\xa0\xbc of their biochemical composition"
gsub("[^[:print:]]", "", x)
# [1] "based on the unique spectral fingerprints of their biochemical composition"

Upvotes: 4

Related Questions