Reputation: 655
I am trying to convert data from an XML file to python dict, but am unable to do so. Following is the code I'm writing.
import xmltodict
input_xml = 'data.xml' # This is the source file
with open(input_xml, encoding='utf-8', errors='ignore') as _file:
data = _file.read()
data = xmltodict.parse(data,'ASCII')
print(data)
exit()
On executing this code, following is the error I'm getting:
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 239, column 40.
After multiple hits and trials, I realized that my xml has some characters in Hindi language, inside a particular tag, as shown below
<DECL>!! आप की सेवा में पुनः पधारे !!</DECL>
How I can ignore these unencoded characters before running xmltodict.parse
?
Upvotes: 2
Views: 2806
Reputation: 5077
I would guess the issue is related to the encoding of the file you are reading. Why are you trying to parse it with 'ASCII'??
If you attempt to read that same XML from a python string without the ASCII it should work just fine:
import xmltodict
xml = """<DECL>!! आप की सेवा में पुनः पधारे !!</DECL>"""
xmltodict.parse(xml, process_namespaces=True)
Results in:
OrderedDict([('DECL', '!! आप की सेवा में पुनः पधारे !!')])
Using a file with that single input line I am able to parse it properly with:
import xmltodict
input_xml = 'tmp.txt' # This is the source file
with open(input_xml, encoding='utf-8', mode='r') as _file:
data = _file.read()
data = xmltodict.parse(data)
print(data)
The issue is most probably that you are trying to parse it as "ASCII".
Upvotes: 1