Reputation: 55
I'm looking at this bug for a couple of days, and looks like htmlParse function have an encoding problem, when parsing Russian symbols.
For example:
htmlParse("http://ru.wikipedia.org/wiki/Russia", encoding="UTF-8")
This page is in UTF-8 encoding, but to be sure, i'm focing htmlParse to encode it in UTF-8.
But in htmlParse() output, English symbols are right encoded, but Russian looks as typical wrong encoded symbols.
I'm using Windows 8 and my locale is Russian_Russia.1251. I think non Unicode locale is the problem here, because when i'm using this command in Ubuntu, everything working as expected, but Ubuntu has en_EN.UTF-8 locale.
Upvotes: 2
Views: 8064
Reputation: 121568
I don't know what you have tried, but this works fine for me:
doc <- htmlParse("http://ru.wikipedia.org/wiki/Russia", encoding="UTF-8")
xpathSApply(doc,'//*[@id="mw-content-text"]/ul/li/a',xmlValue)
[1] "Russia (фильм)" "Киры Муратовой" "Наша Russia"
"Руша (Огайо)" "англ." "Россия (значения)"
Upvotes: 1