Vincent
Vincent

Reputation: 1157

GAE Python: Importing UTF-8 Characters from an XML file to a database model

I am parsing an XML file from an online source but am having troubles reading utf-8 characters. Now I have read through some of the other questions that treat a similar problem, however none of the solutions so far works. Currently the code looks like below.

class XMLParser(webapp2.RequestHandler):

def get(self):

        url = fetch('some.xml.online')

        xml = parseString(url.content)

        vouchers = xml.getElementsByTagName("VoucherCode")

        for voucher in vouchers:

          if voucher.getElementsByTagName("ActivePartnership")[0].firstChild.data == "true":

            coupon = Coupon()
            coupon.description = str(voucher.getElementsByTagName("Description")[0].firstChild.data.decode('utf-8'))
            coupon.prov_key = str(voucher.getElementsByTagName("Id")[0].firstChild.data)
            coupon.put()
            self.redirect('/admin/coupon')

The error that I get from this is displayed below. It is caused by a "ü" in the description field, which I will also need to display later on when using the data.

File "C:\Users\Vincent\Documents\www\Sparkompass\Website\main.py", line 217, in get coupon.description = str(voucher.getElementsByTagName("Description")[0].firstChild.data.decode('utf-8')) File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 16: ordinal not in range(128)

If I take out the description everything works as it should. In the database model definition I have defined the description as follows:

description = db.StringProperty(multiline=True)

Attempt 2

I have also tried to do it like this:

coupon.description = str(voucher.getElementsByTagName("Description")[0].firstChild.data).decode('utf-8')

Which also gave me:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 16: ordinal not in range(128)

Any help would be very much appreciated!

UPDATE

The XML file contains German language, meaning that many more of the characters in there are UTF-8 characters. Ideally therefore I am thinking now that it might be better to do the decoding at a higher level, e.g. at

xml = parseString(url.content)

However so far I didn't get that to work either. The aim is to get the characters in ascii because this is what GAE requires to register it as a string in the database model.

Upvotes: 2

Views: 479

Answers (2)

Vincent
Vincent

Reputation: 1157

I solved the problem for now by changing the description to a TextProperty, which didn't give any error. I am aware that I won't e.g. be able to sort or filter when doing this but for the description this should be ok.

Background info: https://developers.google.com/appengine/docs/python/datastore/typesandpropertyclasses#TextProperty

Upvotes: 0

Kiwisauce
Kiwisauce

Reputation: 1354

>>> u"ü".decode("utf-8")

UnicodeEncodeError

>>> u"ü".encode("utf-8") 

'\xc3\xbc'

>>> u"ü".encode("utf-8").decode("utf-8")

u'\xfc'

>>> str(u"ü".encode("utf-8").decode("utf-8"))

UnicodeEncodeError

>>> str(u"ü".encode("utf-8"))

'\xc3\xbc'

Which encoding do you need?

You could also use:

string2 = cgi.escape(string).encode("latin-1", "xmlcharrefreplace") 

This replaces all non latin-1 chars to xml entities.

Upvotes: 0

Related Questions