Trying to convert strings into unicode to load UFT-8 XML file

Question

I am building an EPG scraper that creates a UTF-8 encoding XML file. All is well, except I am having trouble encoding all the bits of strings I am stitching together into a unicode string that I can load into my file.

My code is as so:

starttime = datetime.strptime(' '.join([str(now.year).encode('UTF-8'), str(e[4].encode('UTF-8')), str(e[0].encode('UTF-8'))]), '%Y %a %d %b %I:%M%p').strftime('%Y%m%d%H%M%S')
endtime = datetime.strptime(' '.join([str(now.year).encode('UTF-8'), str(e[4].encode('UTF-8')), str(e[1].encode('UTF-8'))]), '%Y %a %d %b %I:%M%p').strftime('%Y%m%d%H%M%S')

global epg_data

clean_channel = str(channel.encode('UTF-8').replace('&', '&').replace("'", "'").replace('"', '"').replace('<', '<').replace('>', '>'))
clean_e2 = str(e[2].encode('UTF-8').replace('&', '&').replace("'", "'").replace('"', '"').replace('<', '<').replace('>', '>'))
clean_e3 = str(e[3].encode('UTF-8').replace('&', '&').replace("'", "'").replace('"', '"').replace('<', '<').replace('>', '>'))
div_list3 = div_list2.encode('UTF-8').replace('&', '&').replace("'", "'").replace('"', '"').replace('<', '<').replace('>', '>')
e5 = str(e[5].encode('UTF-8').replace('&', '&').replace("'", "'").replace('"', '"').replace('<', '<').replace('>', '>'))

epg_data = ''.join([u'
', \
u'',e5,u'
',clean_e2,' ',clean_e3,u'

', \
u'UK
'])

I am hitting a problem when trying to parse the following (as printed to IDLE):


Hustle
Hustle Tiger Troubles Season 6 Episode 3/6When a notorious hardman demands Â£500,000 from Albert by the end of the week, the team tries to raise the cash by targeting a playboy in possession of a gold tiger worth a vast amount of money. Emma is sent to persuade the owner to lend the item to a major museum, in the hope the gang can steal it, but an impenetrable vault causes complications. Guest starring former Doctor Who star Colin Baker and Lolita Chakrabarti : 8.2

UK

The generated error is as so:

Traceback (most recent call last):
  File "G:\Python27\Kodi\Sky TV Guide Scraper.py", line 332, in soup_to_text
    u'UK
'])
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 75: ordinal not in range(128)

I've sort of lost my way a bit with sorting this out, so any help would be gratefully received.

Thanks

tdelaney · Accepted Answer

Unicode support is rather confusing in python 2. That's in the top 50 reasons to move to python 3. Encoding a str or unicode to utf-8 returns a str object which is indistinguishable from a regular ASCII string. You just have to remember that its encoded. str(channel.encode('utf-8')) is a bit redundant (its already a str so the str(..) part isn't necessary.

When you called ''.join([u' {e5} {e2} {e3} UK """.format(channel=escape(channel), starttime=starttime, endtime=endtime,e5=escape(e5), e2=escape(e2), e3=escape(e3), div_list2=escape(div_list2))

Trying to convert strings into unicode to load UFT-8 XML file

Answers (1)

Related Questions