vlk
vlk

Reputation: 2769

Python xml.etree.ElementTree parse force encoding

I getting many XML files and some of them has wrong encoding (e.g. in xml header is ISO-8859-1, but all the strings are in UTF-8, and so on)

For parsing is used xml.etree.ElementTree and this also read xml header with encoding (which is sometimes wrong)

input_element = xml.etree.ElementTree.parse("input.xml").getroot()

I would like to force another encoding and ignore this from header.

Is there any simple way how to do this?

Upvotes: 2

Views: 2774

Answers (1)

Tomalak
Tomalak

Reputation: 338208

If you are sure of the encoding, you can use open() to read the file into a string, and then use ElementTree.fromstring() to convert that string into an XML document.

with open("input.xml", encoding="Windows-1252") as fp:
    xml_string = fp.read()
    tree = ElementTree.fromstring(xml_string)

This will ignore the XML declaration, since the file is already decoded, albeit manually. For normal/compliant XML documents, this method is not recommended and you should use ElementTree.parse('filename') instead.

Upvotes: 6

Related Questions