Reputation: 183
I search everywhere for an answer but could not find the right one. I need to convert a string to a specific encoding in R, but was not able to do so:
string <- "überhaupt"
What I need: "überhaupt"
to following functions I have used so far:
textutils::HTMLencode(string) gives: "überhaupt"
utf8::utf8_print(string, utf8 = F) gives: "\u00fcberhaupt"
iconv(string, from = "windows-1252", "utf-8") gives: "überhaupt"
It seems that I need the hex Numeric character reference https://en.wikipedia.org/wiki/%C3%9C but I don't know how to convert.
Thanks for your help
Upvotes: 4
Views: 2225
Reputation: 183
I have now found the perfect answer. So it should work on all systems:
function(x){
x%>%
str_extract_all(., "[:print:]")%>%
map(.x = ., ~stringi::stri_escape_unicode(.x))%>%
map(.x = ., ~str_replace_all(.x, "\\\\u0*", "&#x" ))%>%
map(.x = ., ~case_when(
str_detect(.x, "&#x") ~ str_c(.x, ";"),
T ~.x))%>%
map(.x =., ~str_c(.x, collapse = ""))%>%
unlist()
}
Thanks @MrFlick for your help!
Upvotes: -2
Reputation: 206207
So it looks like you want the "numeric character reference" encoding from that page. I'm not sure if there is a built in function for that, but this is one attempt at writing such a function
char_ref_encode <- function(x) {
cp <- charToRaw(x)
parts <- rle(cp>127)
with(parts, {
starts <- head(cumsum(c(0, lengths)), -1) + 1
ends <- cumsum(lengths)
paste0(mapply(function(v, start, end) {
if (v) {
paste(sprintf("&#x%02x;", as.numeric(cp[start:end])), collapse="")
} else {
intToUtf8(cp[start:end])
}
}, values, starts, ends), collapse="")
})
}
char_ref_encode("überhaupt")
# [1] "überhaupt"
The basic idea is to look for all the non-ascii characters and then encoding them with their hex values.
Upvotes: 5