Vadim Smakhtin
Vadim Smakhtin

Reputation: 55

R - htmlParse() from XML package can't understand Russian letters

I'm looking at this bug for a couple of days, and looks like htmlParse function have an encoding problem, when parsing Russian symbols.

For example:

htmlParse("http://ru.wikipedia.org/wiki/Russia", encoding="UTF-8")

This page is in UTF-8 encoding, but to be sure, i'm focing htmlParse to encode it in UTF-8.

But in htmlParse() output, English symbols are right encoded, but Russian looks as typical wrong encoded symbols.

I'm using Windows 8 and my locale is Russian_Russia.1251. I think non Unicode locale is the problem here, because when i'm using this command in Ubuntu, everything working as expected, but Ubuntu has en_EN.UTF-8 locale.

Upvotes: 2

Views: 8064

Answers (1)

agstudy
agstudy

Reputation: 121568

I don't know what you have tried, but this works fine for me:

doc <- htmlParse("http://ru.wikipedia.org/wiki/Russia", encoding="UTF-8")
 xpathSApply(doc,'//*[@id="mw-content-text"]/ul/li/a',xmlValue)
[1] "Russia (фильм)"    "Киры Муратовой"    "Наша Russia"      
    "Руша (Огайо)"      "англ."             "Россия (значения)"

Upvotes: 1

Related Questions