Apogentus
Apogentus

Reputation: 6613

HTML elements in lxml get incorrectly encoded like Най

I need to print RSS link from a web page, but this link is decoded incorrectly. Here is my code:

import urllib2
from lxml import html, etree
import chardet

data = urllib2.urlopen('http://facts-and-joy.ru/')
S=data.read()
encoding = chardet.detect(S)['encoding']
#S=S.decode(encoding)
#encoding='utf-8'

print encoding
parser = html.HTMLParser(encoding=encoding)
content = html.document_fromstring(S,parser)
loLinks = content.xpath('//link[@type="application/rss+xml"]')

for oLink in loLinks:
    print oLink.xpath('@title')[0]
    print etree.tostring(oLink,encoding='utf-8')

Here is my output:

utf-8
Позитивное мышление RSS Feed
<link rel="alternate" type="application/rss+xml" title="&#x41F;&#x43E;&#x437;&#x438;&#x442;&#x438;&#x432;&#x43D;&#x43E;&#x435; &#x43C;&#x44B;&#x448;&#x43B;&#x435;&#x43D;&#x438;&#x435; RSS Feed" href="http://facts-and-joy.ru/feed/" />&#13;

Title contents got correctly displayed by itself, but inside tostring() it got replaced by strange &#... symbols. How can I print whole link element correctly?

Thanks in advance for your help!

Upvotes: 2

Views: 963

Answers (1)

mzjn
mzjn

Reputation: 50947

Here is a simplified version of your program that works:

from lxml import html

url = 'http://facts-and-joy.ru/'
content = html.parse(url)
rsslinks = content.xpath('//link[@type="application/rss+xml"]')

for link in rsslinks:
    print link.get('title')
    print html.tostring(link, encoding="utf-8")

Output:

Позитивное мышление RSS Feed
<link rel="alternate" type="application/rss+xml" title="Позитивное мышление RSS Feed" href="http://facts-and-joy.ru/feed/">&#13;

The crucial line is

print html.tostring(link, encoding="utf-8")

That is the only thing you must change in your original program.

Using html.tostring() instead of etree.tostring() produces actual characters instead of numeric character references. You could also use etree.tostring(link, method="html", encoding="utf-8").

It is not clear why this difference exists between the "html" and "xml" output methods. This post to the lxml mailing list didn't get any replies: https://mailman-mail5.webfaction.com/pipermail/lxml/2011-September/006131.html.

Upvotes: 2

Related Questions