Reputation: 141
I'm trying to automate the extraction of data from a large number of files, and it works for the most part. It just falls over when it encounters non-ASCII characters:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 5: ordinal not in range(128)
How do I set my 'brand' to UTF-8? My code is being repurposed from something else (which was using lxml), and that didn't have any issues. I've seen lots of discussions about encode / decode, but I don't understand how I'm supposed to implement it. The below is cut down to just the relevant code - I've removed the rest.
i = 0
filenames = [y for x in os.walk("Distributor") for y in glob(os.path.join(x[0], '*.xml'))]
for i in range (len(filenames)):
pathname = filenames[i]
fin = open(pathname, 'r')
with codecs.open(('Assets'+'.log'), mode='w', encoding='utf-8') as f:
f.write(u'File Path|Brand\n')
lines = fin.read()
brand_start = lines.find("Brand Title")
brand_end = lines.find("/>",brand_start)
brand = lines [brand_start+47:brand_end-2]
f.write(u'{}|{}\n'.format(pathname[4:35],brand))
flog.close()
I'm sure there is a better way to write the whole thing, but at the moment my focus is just on trying to understand how to get the lines / read functions to work with UTF-8.
Upvotes: 1
Views: 98
Reputation: 141
This is my final code, using the guidance from above. It's not pretty, but it solves the problem. I'll look at getting it all working using lxml at a later date (as this is something I've encountered before when working with different, larger xml files):
import lxml
import io
import os
from lxml import etree
from glob import glob
nsmap = {'xmlns': 'thisnamespace'}
i = 0
filenames = [y for x in os.walk("Distributor") for y in glob(os.path.join(x[0], '*.xml'))]
with io.open(('Assets.log'),'w',encoding='utf-8') as f:
f.write(u'File Path|Series|Brand\n')
for i in range (len(filenames)):
pathname = filenames[i]
parser = lxml.etree.XMLParser()
tree = lxml.etree.parse(pathname, parser)
root = tree.getroot()
fin = open(pathname, 'r')
with io.open(pathname, encoding='utf-8') as fin:
for info in root.xpath('//somepath'):
series_x = info.find ('./somemorepath')
series = series_x.get('Asset_Name') if series_x != None else 'Missing'
lines = fin.read()
brand_start = lines.find(u"sometext")
brand_end = lines.find(u"/>",brand_start)
brand = lines [brand_start:brand_end-2]
brand = brand[(brand.rfind("/"))+1:]
f.write(u'{}|{}|{}\n'.format(pathname[5:42],series,brand))
f.close()
Someone will now come along and do it all in one line!
Upvotes: 0
Reputation: 1121644
You are mixing bytestrings with Unicode values; your fin
file object produces bytestrings, and you are mixing it with Unicode here:
f.write(u'{}|{}\n'.format(pathname[4:35],brand))
brand
is a bytestring, interpolated into a Unicode format string. Either decode brand
there, or better yet, use io.open()
(rather than codecs.open()
, which is not as robust as the newer io
module) to manage both your files:
with io.open('Assets.log', 'w', encoding='utf-8') as f,\
io.open(pathname, encoding='utf-8') as fin:
f.write(u'File Path|Brand\n')
lines = fin.read()
brand_start = lines.find(u"Brand Title")
brand_end = lines.find(u"/>", brand_start)
brand = lines[brand_start + 47:brand_end - 2]
f.write(u'{}|{}\n'.format(pathname[4:35], brand))
You also appear to be parsing out an XML file by hand; perhaps you want to use the ElementTree API instead to parse out those values. In that case, you'd open the file without io.open()
, so producing byte strings, so that the XML parser can correctly decode the information to Unicode values for you.
Upvotes: 1