How to match all internationalized text?

Question

I'm on a search-and-destroy mission for anything Amazon finds distasteful. In the past I've dealt with this by using iconv to convert from "UTF-8" to "latin1", but I can't do that here because it's encoded as "unknown":

test <- "Gwena\xeblle M"
> gsub("\xeb","", df[306,"primauthfirstname"] )
[1] "Gwenalle M"
> Encoding(df[306,"primauthfirstname"])
[1] "unknown"

So what regex eliminates all the \x## codes?

Josh O&#39;Brien · Accepted Answer

I believe this pattern should work:

pat <- "[\x80-\xFF]"

test <- c("Gwena\xeblle M", "\x92","\xe4","\xe1","\xeb") 
gsub(pat, "", test, perl=TRUE)
# [1] "Gwenalle M" ""           ""           ""           ""

Explanation:

It works because the character class "[\x00-\xFF]" would match all characters of the form \x##. But the first half of those -- the 0th to 127th (or 00'th to 7F'th in hex digits) -- are the ASCII characters. So it's the second half of them -- the 128th to 255th (or 80'th to FF'th in hex mode) -- that you want to search out and destroy.

How to match all internationalized text?

Answers (2)

Related Questions