Roland
Roland

Reputation: 537

Reading an XML file which contains unicode in python 2.7

I'm trying to python 2.7.6 with ElementTree to parse an xml file which is encoded in unicode from some server, and save the contained data locally.

import xml.etree.ElementTree as ET

def normalize(string):
    if isinstance(string, unicode): 
        normalized_string  = unicodedata.normalize('NFKD', string).encode('ascii','ignore')
    elif isinstance(string, str):
        normalized_string  = string
    else:
        print "no string"
        normalized_string  = string

    normalized_string  = ''.join(e for e in normalized_string if e.isalnum())
    return normalized_string

tree = ET.parse('test.xml')
root = tree.getroot()

for element in root:
    value = element.find('value').text
    filename = normalize(element.find('name').text.encode('utf-8')) + '.txt'
    target = open(filename, 'a')
    target.write(value + '\n')
    target.close()

The file I'm parsing from is in stucture similar to the following, which I've saved locally as test.xml:

<data> 
<product><name>Something with a space</name><value>10</value> </product>
<product><name>Jakub Šlemr</name><value>12</value></product>
<product><name>Something with: a colon</name><value>11</value></product>
</data>

The code above has multiple problems, which I'd like to solve:

  1. The unicode character Š was not well-digested by this code. Edit: This has been resolved, as it was partly due to wrong file encoding.
  2. I'd like to avoid special characters in the filenames, such as whitespaces and colons. What's the best way of preprocessing these? I've built a normalize function based on the answers from Remove all special characters, punctuation and spaces from string and Convert a Unicode string to a string in Python (containing extra symbols). Is this an OK approach?
  3. Is element.find('value').text the best way to access the values stored in the xml document, assuming that every element has exacly one entry named value?

Upvotes: 1

Views: 2827

Answers (1)

jsalonen
jsalonen

Reputation: 30531

Values in element.find('value').text are unicode objects. When you append them together with ascii string objects like '.txt', they are concatenated along with required conversions.

You cannot print or store unicode objects before you serialize them. If you don't do that explicitly, Python will do that implicitly using default encoding settings. Default encoding is ASCII, which supports only very limited set of characters leading to a UnicodeEncodeError with any input data containing non-ascii characters.

I would suggest you to explicitly encode your unicode objects with encode() method into strings by using codec that is appropriate for your solution. For example if you want to encode your text element into UTF-8 encoded string, invoke:

element.find('value').text.encode('utf-8')

Also, check that encoding attribute in your XML is correctly set. Wrong encoding would be a very probable reason for a parse error.

Upvotes: 2

Related Questions