user2511874
user2511874

Reputation:

Parsing a UTF-8 XML file

I want to parse a file with minidom:

with codecs.open(fname, encoding="utf-8") as xml:
   dom = parse(xml)

Returns a UnicodeEncodeError. The XML file is in UTF-8 without BOM format and has

<?xml version="1.0" encoding="utf-8"?>

in the first line.

If I first read the file, .encode("utf-8") it and pass it to parseString, it works. Is there a way to parse an UTF-8 XML file directly with minidom.parse?

Upvotes: 2

Views: 2872

Answers (1)

Martijn Pieters
Martijn Pieters

Reputation: 1121176

Leave the decoding to the XML parser; it'll detect what codec to use. Open the file without converting to unicode:

with open(fname) as xml:
    dom = parse(xml)

Note the use of the standard function open() instead of codecs.open().

This applies to any XML parser; it is the job of the parser to determine from the XML prologue what codec to use for parsing the document. If no prologue is present then UTF-8 is the default.

Upvotes: 2

Related Questions