gdogg371
gdogg371

Reputation: 4152

Trying to convert strings into unicode to load UFT-8 XML file

I am building an EPG scraper that creates a UTF-8 encoding XML file. All is well, except I am having trouble encoding all the bits of strings I am stitching together into a unicode string that I can load into my file.

My code is as so:

starttime = datetime.strptime(' '.join([str(now.year).encode('UTF-8'), str(e[4].encode('UTF-8')), str(e[0].encode('UTF-8'))]), '%Y %a %d %b %I:%M%p').strftime('%Y%m%d%H%M%S')
endtime = datetime.strptime(' '.join([str(now.year).encode('UTF-8'), str(e[4].encode('UTF-8')), str(e[1].encode('UTF-8'))]), '%Y %a %d %b %I:%M%p').strftime('%Y%m%d%H%M%S')

global epg_data

clean_channel = str(channel.encode('UTF-8').replace('&', '&amp;').replace("'", "&apos;").replace('"', '&quot;').replace('<', '&lt;').replace('>', '&gt;'))
clean_e2 = str(e[2].encode('UTF-8').replace('&', '&amp;').replace("'", "&apos;").replace('"', '&quot;').replace('<', '&lt;').replace('>', '&gt;'))
clean_e3 = str(e[3].encode('UTF-8').replace('&', '&amp;').replace("'", "&apos;").replace('"', '&quot;').replace('<', '&lt;').replace('>', '&gt;'))
div_list3 = div_list2.encode('UTF-8').replace('&', '&amp;').replace("'", "&apos;").replace('"', '&quot;').replace('<', '&lt;').replace('>', '&gt;')
e5 = str(e[5].encode('UTF-8').replace('&', '&amp;').replace("'", "&apos;").replace('"', '&quot;').replace('<', '&lt;').replace('>', '&gt;'))

epg_data = ''.join([u'<programme start="',starttime,u' +0100" stop="',endtime,u' +0100" channel="',clean_channel,u'">\n', \
u'<title lang="eng">',e5,u'</title>\n<desc lang="eng">',clean_e2,' ',clean_e3,u'</desc>\n<icon src="',div_list3,u'" />\n', \
u'<country>UK</country>\n</programme>'])

I am hitting a problem when trying to parse the following (as printed to IDLE):

<programme start="20180514180500 +0100" stop="20180514190000 +0100" channel="BBC Entertainment">
<title lang="eng">Hustle</title>
<desc lang="eng">Hustle Tiger Troubles Season 6 Episode 3/6When a notorious hardman demands £500,000 from Albert by the end of the week, the team tries to raise the cash by targeting a playboy in possession of a gold tiger worth a vast amount of money. Emma is sent to persuade the owner to lend the item to a major museum, in the hope the gang can steal it, but an impenetrable vault causes complications. Guest starring former Doctor Who star Colin Baker and Lolita Chakrabarti : 8.2</desc>
<icon src="http://my.tvguide.co.uk/channel_logos/60x35/68.png" />
<country>UK</country>
</programme>

The generated error is as so:

Traceback (most recent call last):
  File "G:\Python27\Kodi\Sky TV Guide Scraper.py", line 332, in soup_to_text
    u'<country>UK</country>\n</programme>'])
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 75: ordinal not in range(128)

I've sort of lost my way a bit with sorting this out, so any help would be gratefully received.

Thanks

Upvotes: 0

Views: 230

Answers (1)

tdelaney
tdelaney

Reputation: 77407

Unicode support is rather confusing in python 2. That's in the top 50 reasons to move to python 3. Encoding a str or unicode to utf-8 returns a str object which is indistinguishable from a regular ASCII string. You just have to remember that its encoded. str(channel.encode('utf-8')) is a bit redundant (its already a str so the str(..) part isn't necessary.

When you called ''.join([u'<programme start="', etc...]), you mixed unicode and str objects, so python tried to promote everything to unicode. You knew that some of those str strings were really utf-8 encoded strings, but python didn't know that. Python 3 would know that and would bark loudly.

The general rule for unicode is to do conversions at the edges. Decode when reading stuff in, encode when writing stuff out. If you had skipped the encode('utf-8') stuff and just stuck with unicode in the snippet you gave, it would have worked.

Two other things to consider: Python can escape the strings for you. cgi.escpae is good for older HTML. xml.sax.saxutils.escape is good for XML and XHTML and HTML5. And str.format can help make more readable string formatting.

Putting it all together...

starttime = datetime.strptime(' '.join([now.year, e[4], e[0], '%Y %a %d %b %I:%M%p').strftime('%Y%m%d%H%M%S')
endtime = datetime.strptime(' '.join([now.year, e[4], e[1]]), '%Y %a %d %b %I:%M%p').strftime('%Y%m%d%H%M%S')

global epg_data

epg_data = u"""\
<programme start="{starttime} +0100" stop="{endtime} +0100" channel="{channel}">
    <title lang="eng">{e5}</title>
    <desc lang="eng">{e2} {e3}</desc>
    <icon src="{div_list2}" />
    <country>UK</country>
</programme>""".format(channel=escape(channel), starttime=starttime, 
    endtime=endtime,e5=escape(e5), e2=escape(e2), e3=escape(e3), 
    div_list2=escape(div_list2))

Upvotes: 2

Related Questions