Reputation: 4152
I am building an EPG scraper that creates a UTF-8 encoding XML file. All is well, except I am having trouble encoding all the bits of strings I am stitching together into a unicode string that I can load into my file.
My code is as so:
starttime = datetime.strptime(' '.join([str(now.year).encode('UTF-8'), str(e[4].encode('UTF-8')), str(e[0].encode('UTF-8'))]), '%Y %a %d %b %I:%M%p').strftime('%Y%m%d%H%M%S')
endtime = datetime.strptime(' '.join([str(now.year).encode('UTF-8'), str(e[4].encode('UTF-8')), str(e[1].encode('UTF-8'))]), '%Y %a %d %b %I:%M%p').strftime('%Y%m%d%H%M%S')
global epg_data
clean_channel = str(channel.encode('UTF-8').replace('&', '&').replace("'", "'").replace('"', '"').replace('<', '<').replace('>', '>'))
clean_e2 = str(e[2].encode('UTF-8').replace('&', '&').replace("'", "'").replace('"', '"').replace('<', '<').replace('>', '>'))
clean_e3 = str(e[3].encode('UTF-8').replace('&', '&').replace("'", "'").replace('"', '"').replace('<', '<').replace('>', '>'))
div_list3 = div_list2.encode('UTF-8').replace('&', '&').replace("'", "'").replace('"', '"').replace('<', '<').replace('>', '>')
e5 = str(e[5].encode('UTF-8').replace('&', '&').replace("'", "'").replace('"', '"').replace('<', '<').replace('>', '>'))
epg_data = ''.join([u'<programme start="',starttime,u' +0100" stop="',endtime,u' +0100" channel="',clean_channel,u'">\n', \
u'<title lang="eng">',e5,u'</title>\n<desc lang="eng">',clean_e2,' ',clean_e3,u'</desc>\n<icon src="',div_list3,u'" />\n', \
u'<country>UK</country>\n</programme>'])
I am hitting a problem when trying to parse the following (as printed to IDLE):
<programme start="20180514180500 +0100" stop="20180514190000 +0100" channel="BBC Entertainment">
<title lang="eng">Hustle</title>
<desc lang="eng">Hustle Tiger Troubles Season 6 Episode 3/6When a notorious hardman demands £500,000 from Albert by the end of the week, the team tries to raise the cash by targeting a playboy in possession of a gold tiger worth a vast amount of money. Emma is sent to persuade the owner to lend the item to a major museum, in the hope the gang can steal it, but an impenetrable vault causes complications. Guest starring former Doctor Who star Colin Baker and Lolita Chakrabarti : 8.2</desc>
<icon src="http://my.tvguide.co.uk/channel_logos/60x35/68.png" />
<country>UK</country>
</programme>
The generated error is as so:
Traceback (most recent call last):
File "G:\Python27\Kodi\Sky TV Guide Scraper.py", line 332, in soup_to_text
u'<country>UK</country>\n</programme>'])
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 75: ordinal not in range(128)
I've sort of lost my way a bit with sorting this out, so any help would be gratefully received.
Thanks
Upvotes: 0
Views: 230
Reputation: 77407
Unicode support is rather confusing in python 2. That's in the top 50 reasons to move to python 3. Encoding a str
or unicode
to utf-8 returns a str
object which is indistinguishable from a regular ASCII string. You just have to remember that its encoded. str(channel.encode('utf-8'))
is a bit redundant (its already a str
so the str(..)
part isn't necessary.
When you called ''.join([u'<programme start="', etc...])
, you mixed unicode
and str
objects, so python tried to promote everything to unicode
. You knew that some of those str
strings were really utf-8 encoded strings, but python didn't know that. Python 3 would know that and would bark loudly.
The general rule for unicode is to do conversions at the edges. Decode when reading stuff in, encode when writing stuff out. If you had skipped the encode('utf-8')
stuff and just stuck with unicode in the snippet you gave, it would have worked.
Two other things to consider: Python can escape the strings for you. cgi.escpae
is good for older HTML. xml.sax.saxutils.escape
is good for XML and XHTML and HTML5. And str.format
can help make more readable string formatting.
Putting it all together...
starttime = datetime.strptime(' '.join([now.year, e[4], e[0], '%Y %a %d %b %I:%M%p').strftime('%Y%m%d%H%M%S')
endtime = datetime.strptime(' '.join([now.year, e[4], e[1]]), '%Y %a %d %b %I:%M%p').strftime('%Y%m%d%H%M%S')
global epg_data
epg_data = u"""\
<programme start="{starttime} +0100" stop="{endtime} +0100" channel="{channel}">
<title lang="eng">{e5}</title>
<desc lang="eng">{e2} {e3}</desc>
<icon src="{div_list2}" />
<country>UK</country>
</programme>""".format(channel=escape(channel), starttime=starttime,
endtime=endtime,e5=escape(e5), e2=escape(e2), e3=escape(e3),
div_list2=escape(div_list2))
Upvotes: 2