Svilen
Svilen

Reputation: 35

Decoding Cyrillic string in R

I would like to decode this string in R: обезпечен. The desired output should be: обезпечен

This site suggest that the source encoding is UTF-8 and it should be trans-coded to Windows-1251. So I tried with no success this:

> word <- "обезпечен"
> iconv(word, from = "UTF-8",to = "Windows-1251")
[1] "обезпечен"

Upvotes: 1

Views: 520

Answers (1)

MrFlick
MrFlick

Reputation: 206242

These steps seem to do the trick

word <- "обезпечен"

xx <- iconv(word, from="UTF-8", to="cp1251")
Encoding(xx) <- "UTF-8"
xx
# [1] "обезпечен"

target <- "обезпечен"
xx == target
# [1] TRUE

So it seems what happened was at one point the bytes that make up the UTF-8 target value were misinterpreted as being cp1251 encoded and somewhere a process ran to convert the bytes to UTF-8 based on the cp1251->UTF-8 mapping rules. However, when you run this on data that insn't really cp1251 encoded you get weird values.

iconv(target, from="cp1251", to="UTF-8")
# "обезпечен"

Upvotes: 2

Related Questions