Reputation: 131038
I have a Python code that tries to read RSS sources written in Cyrillic letters (for example Russian). This is the code that I use:
import feedparser
from urllib2 import Request, urlopen
d=feedparser.parse(source_url)
# Make a loop over the entries of the RSS feed.
for e in d.entries:
# Get the title of the news.
title = e.title
title = title.replace(' ','%20')
title = title.encode('utf-8')
# Get the URL of the entry.
url = e.link
url = url.encode('utf-8')
# Make the request.
address = 'http://example.org/save_link.php?title=' + title + '&source=' + source_name + '&url=' + url
# Submit the link.
req = Request(address)
f = urlopen(req)
I use encode('utf-8')
since the titles are given in Cyrillic letters and it works fine. An example of the RSS source is here. The problem appears when I try to read the list of the RSS sources from another URL. In more details, there is a web-page that contains a list of RSS sources (URL of the sources as well as their names given in Cyrillic letters). An example of the list is here:
<!DOCTYPE HTML PUBLIC '-//W3C//DTD HTML 4.01 Transitional//EN' 'http://www.w3.org/TR/html4/loose.dtd'>
<html>
<head>
<title></title>
<meta http-equiv='Content-Type' content='text/html;charset=utf-8'>
ua, Корреспондент, http://k.img.com.ua/rss/ua/news.xml
ua, Українська Правда, http://www.pravda.com.ua/rss/
</body>
</html>
The problem appears when I try to apply encode('utf-8') to the Cyrillic letters given in this document. I get an UnicodeDecodeError
. Does anybody knows why?
Upvotes: 2
Views: 241
Reputation: 157314
encode
will only give UnicodeDecodeError
if you supply it a str
object which it then tries to decode to unicode
; see http://wiki.python.org/moin/UnicodeDecodeError.
You need to decode the str
object to unicode
first:
name = name.decode('utf-8')
This will take a str
in UTF-8 encoding and give you a unicode
object.
It works for the code that you posted because feedparser
returns feed data already decoded to unicode
.
Upvotes: 6