Reputation:
I want to parse a file with minidom:
with codecs.open(fname, encoding="utf-8") as xml:
dom = parse(xml)
Returns a UnicodeEncodeError. The XML file is in UTF-8 without BOM format and has
<?xml version="1.0" encoding="utf-8"?>
in the first line.
If I first read the file, .encode("utf-8") it and pass it to parseString, it works. Is there a way to parse an UTF-8 XML file directly with minidom.parse?
Upvotes: 2
Views: 2872
Reputation: 1121176
Leave the decoding to the XML parser; it'll detect what codec to use. Open the file without converting to unicode:
with open(fname) as xml:
dom = parse(xml)
Note the use of the standard function open()
instead of codecs.open()
.
This applies to any XML parser; it is the job of the parser to determine from the XML prologue what codec to use for parsing the document. If no prologue is present then UTF-8 is the default.
Upvotes: 2