Reputation: 537
I'm trying to python 2.7.6 with ElementTree to parse an xml file which is encoded in unicode from some server, and save the contained data locally.
import xml.etree.ElementTree as ET
def normalize(string):
if isinstance(string, unicode):
normalized_string = unicodedata.normalize('NFKD', string).encode('ascii','ignore')
elif isinstance(string, str):
normalized_string = string
else:
print "no string"
normalized_string = string
normalized_string = ''.join(e for e in normalized_string if e.isalnum())
return normalized_string
tree = ET.parse('test.xml')
root = tree.getroot()
for element in root:
value = element.find('value').text
filename = normalize(element.find('name').text.encode('utf-8')) + '.txt'
target = open(filename, 'a')
target.write(value + '\n')
target.close()
The file I'm parsing from is in stucture similar to the following, which I've saved locally as test.xml
:
<data>
<product><name>Something with a space</name><value>10</value> </product>
<product><name>Jakub Šlemr</name><value>12</value></product>
<product><name>Something with: a colon</name><value>11</value></product>
</data>
The code above has multiple problems, which I'd like to solve:
Š
was not well-digested by this code. Edit: This has been resolved, as it was partly due to wrong file encoding.normalize
function based on the answers from Remove all special characters, punctuation and spaces from string and Convert a Unicode string to a string in Python (containing extra symbols). Is this an OK approach?element.find('value').text
the best way to access the values stored in the xml document, assuming that every element
has exacly one entry named value
?Upvotes: 1
Views: 2827
Reputation: 30531
Values in element.find('value').text
are unicode objects. When you append them together with ascii string objects like '.txt'
, they are concatenated along with required conversions.
You cannot print or store unicode objects before you serialize them. If you don't do that explicitly, Python will do that implicitly using default encoding settings. Default encoding is ASCII, which supports only very limited set of characters leading to a UnicodeEncodeError
with any input data containing non-ascii characters.
I would suggest you to explicitly encode your unicode objects with encode()
method into strings by using codec that is appropriate for your solution. For example if you want to encode your text element into UTF-8
encoded string, invoke:
element.find('value').text.encode('utf-8')
Also, check that encoding attribute in your XML is correctly set. Wrong encoding would be a very probable reason for a parse error.
Upvotes: 2