Kushagra
Kushagra

Reputation: 655

Python 3: Unable to convert XML to dict using xmltodict

I am trying to convert data from an XML file to python dict, but am unable to do so. Following is the code I'm writing.

import xmltodict
input_xml  = 'data.xml'  # This is the source file

with open(input_xml, encoding='utf-8', errors='ignore') as _file:
    data = _file.read()
    data = xmltodict.parse(data,'ASCII')
    print(data)
    exit()

On executing this code, following is the error I'm getting:
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 239, column 40.
After multiple hits and trials, I realized that my xml has some characters in Hindi language, inside a particular tag, as shown below

<DECL>!! आप की सेवा में पुनः पधारे !!</DECL>

How I can ignore these unencoded characters before running xmltodict.parse?

Upvotes: 2

Views: 2806

Answers (1)

Jo&#227;o Almeida
Jo&#227;o Almeida

Reputation: 5077

I would guess the issue is related to the encoding of the file you are reading. Why are you trying to parse it with 'ASCII'??

If you attempt to read that same XML from a python string without the ASCII it should work just fine:

import xmltodict
xml = """<DECL>!! आप की सेवा में पुनः पधारे !!</DECL>"""
xmltodict.parse(xml, process_namespaces=True)

Results in:

OrderedDict([('DECL', '!! आप की सेवा में पुनः पधारे !!')]) 

Using a file with that single input line I am able to parse it properly with:

import xmltodict
input_xml  = 'tmp.txt'  # This is the source file

with open(input_xml, encoding='utf-8', mode='r') as _file:
    data = _file.read()
    data = xmltodict.parse(data)
    print(data)

The issue is most probably that you are trying to parse it as "ASCII".

Upvotes: 1

Related Questions