Nick
Nick

Reputation: 141

Struggling with unicode in Python

I'm trying to automate the extraction of data from a large number of files, and it works for the most part. It just falls over when it encounters non-ASCII characters:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 5: ordinal not in range(128)

How do I set my 'brand' to UTF-8? My code is being repurposed from something else (which was using lxml), and that didn't have any issues. I've seen lots of discussions about encode / decode, but I don't understand how I'm supposed to implement it. The below is cut down to just the relevant code - I've removed the rest.

i = 0

filenames = [y for x in os.walk("Distributor") for y in glob(os.path.join(x[0], '*.xml'))]

for i in range (len(filenames)):
    pathname = filenames[i]

    fin = open(pathname, 'r')
    with codecs.open(('Assets'+'.log'), mode='w', encoding='utf-8') as f:
        f.write(u'File Path|Brand\n')
        lines = fin.read()
        brand_start = lines.find("Brand Title")
        brand_end = lines.find("/>",brand_start)
        brand = lines [brand_start+47:brand_end-2]
        f.write(u'{}|{}\n'.format(pathname[4:35],brand))

flog.close()

I'm sure there is a better way to write the whole thing, but at the moment my focus is just on trying to understand how to get the lines / read functions to work with UTF-8.

Upvotes: 1

Views: 98

Answers (2)

Nick
Nick

Reputation: 141

This is my final code, using the guidance from above. It's not pretty, but it solves the problem. I'll look at getting it all working using lxml at a later date (as this is something I've encountered before when working with different, larger xml files):

import lxml
import io
import os

from lxml import etree
from glob import glob

nsmap = {'xmlns': 'thisnamespace'}

i = 0

filenames = [y for x in os.walk("Distributor") for y in glob(os.path.join(x[0], '*.xml'))] 

with io.open(('Assets.log'),'w',encoding='utf-8') as f:
    f.write(u'File Path|Series|Brand\n')

    for i in range (len(filenames)):
        pathname = filenames[i]
        parser = lxml.etree.XMLParser()
        tree = lxml.etree.parse(pathname, parser)
        root = tree.getroot()
        fin = open(pathname, 'r')

        with io.open(pathname, encoding='utf-8') as fin:  

            for info in root.xpath('//somepath'):
                series_x = info.find ('./somemorepath')
                series = series_x.get('Asset_Name') if series_x != None else 'Missing'
                lines = fin.read()
                brand_start = lines.find(u"sometext")
                brand_end = lines.find(u"/>",brand_start)
                brand = lines [brand_start:brand_end-2]
                brand = brand[(brand.rfind("/"))+1:]
                f.write(u'{}|{}|{}\n'.format(pathname[5:42],series,brand))

f.close()

Someone will now come along and do it all in one line!

Upvotes: 0

Martijn Pieters
Martijn Pieters

Reputation: 1121644

You are mixing bytestrings with Unicode values; your fin file object produces bytestrings, and you are mixing it with Unicode here:

f.write(u'{}|{}\n'.format(pathname[4:35],brand))

brand is a bytestring, interpolated into a Unicode format string. Either decode brand there, or better yet, use io.open() (rather than codecs.open(), which is not as robust as the newer io module) to manage both your files:

with io.open('Assets.log', 'w', encoding='utf-8') as f,\
        io.open(pathname, encoding='utf-8') as fin:
    f.write(u'File Path|Brand\n')
    lines = fin.read()
    brand_start = lines.find(u"Brand Title")
    brand_end = lines.find(u"/>", brand_start)
    brand = lines[brand_start + 47:brand_end - 2]
    f.write(u'{}|{}\n'.format(pathname[4:35], brand))

You also appear to be parsing out an XML file by hand; perhaps you want to use the ElementTree API instead to parse out those values. In that case, you'd open the file without io.open(), so producing byte strings, so that the XML parser can correctly decode the information to Unicode values for you.

Upvotes: 1

Related Questions