Reputation: 6613
I need to print RSS link from a web page, but this link is decoded incorrectly. Here is my code:
import urllib2
from lxml import html, etree
import chardet
data = urllib2.urlopen('http://facts-and-joy.ru/')
S=data.read()
encoding = chardet.detect(S)['encoding']
#S=S.decode(encoding)
#encoding='utf-8'
print encoding
parser = html.HTMLParser(encoding=encoding)
content = html.document_fromstring(S,parser)
loLinks = content.xpath('//link[@type="application/rss+xml"]')
for oLink in loLinks:
print oLink.xpath('@title')[0]
print etree.tostring(oLink,encoding='utf-8')
Here is my output:
utf-8
Позитивное мышление RSS Feed
<link rel="alternate" type="application/rss+xml" title="Позитивное мышление RSS Feed" href="http://facts-and-joy.ru/feed/" />
Title contents got correctly displayed by itself, but inside tostring() it got replaced by strange &#... symbols. How can I print whole link element correctly?
Thanks in advance for your help!
Upvotes: 2
Views: 963
Reputation: 50947
Here is a simplified version of your program that works:
from lxml import html
url = 'http://facts-and-joy.ru/'
content = html.parse(url)
rsslinks = content.xpath('//link[@type="application/rss+xml"]')
for link in rsslinks:
print link.get('title')
print html.tostring(link, encoding="utf-8")
Output:
Позитивное мышление RSS Feed
<link rel="alternate" type="application/rss+xml" title="Позитивное мышление RSS Feed" href="http://facts-and-joy.ru/feed/">
The crucial line is
print html.tostring(link, encoding="utf-8")
That is the only thing you must change in your original program.
Using html.tostring()
instead of etree.tostring()
produces actual characters instead of numeric character references. You could also use etree.tostring(link, method="html", encoding="utf-8")
.
It is not clear why this difference exists between the "html" and "xml" output methods. This post to the lxml mailing list didn't get any replies: https://mailman-mail5.webfaction.com/pipermail/lxml/2011-September/006131.html.
Upvotes: 2