Reputation: 421
I don't understand this error code. Could anyone help me?
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 2:
ordinal not in range(128)
This is the code:
import urllib2, os, zipfile
from lxml import etree
def xmlSplitter(data,separator=lambda x: x.startswith('<?xml')):
buff = []
for line in data:
if separator(line):
if buff:
yield ''.join(buff)
buff[:] = []
buff.append(line)
yield ''.join(buff)
def first(seq,default=None):
"""Return the first item from sequence, seq or the default(None) value"""
for item in seq:
return item
return default
datasrc = "http://commondatastorage.googleapis.com/patents/grantbib/2011/ipgb20110104_wk01.zip"
filename = datasrc.split('/')[-1]
if not os.path.exists(filename):
with open(filename,'wb') as file_write:
r = urllib2.urlopen(datasrc)
file_write.write(r.read())
zf = zipfile.ZipFile(filename)
xml_file = first([ x for x in zf.namelist() if x.endswith('.xml')])
assert xml_file is not None
count = 0
for item in xmlSplitter(zf.open(xml_file)):
count += 1
if count > 10: break
doc = etree.XML(item)
docID = first(doc.xpath('//publication-reference/document-id/doc-number/text()'))
title = first(doc.xpath('//invention-title/text()'))
lastName = first(doc.xpath('//addressbook/last-name/text()'))
firstName = first(doc.xpath('//addressbook/first-name/text()'))
street = first(doc.xpath('//addressbook/address/street/text()'))
city = first(doc.xpath('//addressbook/address/city/text()'))
state = first(doc.xpath('//addressbook/address/state/text()'))
postcode = first(doc.xpath('//addressbook/address/postcode/text()'))
country = first(doc.xpath('//addressbook/address/country/text()'))
print "DocID: {0}\nTitle: {1}\nLast Name: {2}\nFirst Name: {3}\nStreet: {4}\ncity: {5}\nstate: {6}\npostcode: {7}\ncountry: {8}\n".format(docID,title,lastName,firstName,street,city,state,postcode,country)
I get the code somewhere on internet, I changed only tiny of it, which was adding the Street, City, state, postcode, and country.
The XML file approximately contains of 2million lines of code, do you think that is the reason?
Upvotes: 0
Views: 2164
Reputation: 1121584
You are parsing XML, and the library already knows how to handle decoding for you. The API returns unicode
objects, but you are trying to treat them as byte strings instead.
Where you call ''.format()
, you are using a python bytestring instead of a unicode
object, so Python has to encode the Unicode values to fit in a bytestring. To do so it can only use a default, which is ASCII.
The simple solution is to use a unicode string there instead, note the u''
string literal:
print u"DocID: {0}\nTitle: {1}\nLast Name: {2}\nFirst Name: {3}\nStreet: {4}\ncity: {5}\nstate: {6}\npostcode: {7}\ncountry: {8}\n".format(docID,title,lastName,firstName,street,city,state,postcode,country)
Python will still have to encode this when printing, but at least now Python can do some auto-detection of your terminal and determine what encoding it needs to use.
You may want to read up on Python and Unicode:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
Pragmatic Unicode by Ned Batchelder
Upvotes: 3
Reputation: 143785
There's no such thing as plain text. Text has always an encoding, which is the way you represent a given symbol (a letter, a comma, a japanese kanji) with a series of bytes. the mapping between the symbol "code" to the bytes is called the encoding.
In python 2.7 the distinction between encoded text (the str) and a generic, unencoded text (the unicode()) is confusing at best. python 3 ditched the whole thing, and you always use unicode types by default.
In any case, what is happening there is that you are trying to read some text and put it into a string, but this text contains something that cannot be coerced to the ASCII encoding. ASCII only understand characters in the range 0-127, which is the standard set of characters (letters, numbers, symbols you use for programming). One possible extension of ASCII is latin-1 (also known as iso-8859-1), where the range 128-255 maps to latin characters such as accented a. This encoding has the advantage that you still get one byte == one character. UTF-8 is another extension of ASCII, where you release the constraint one byte == one character and allow some characters to be represented with one byte, some with two, and so on.
To solve your problem, it depends. It depends on where the problem comes in. I guess you are parsing a text file that is encoded in some encoding you don't know, which, I guess, could be either latin-1 or UTF-8. if you do so, you have to open the file specifying the encoding='utf-8' at open(), but it depends. It's hard to say from what you provide.
Upvotes: 3
Reputation: 5846
The ASCII characters range from 0 (\x00) to 127 (\x7F). Your character (\xE4=228) is bigger than the highest possible value. Therefore you have to change the codec (for example to UTF-8) to be able to encode this value.
Upvotes: 1