Reputation: 91590
I have a number of XML files I'd like to process with a script, converting them from whatever encoding that they're in to UTF-8.
Using the code given in this great answer I can do the conversion, but how can I read the encoding given in the XML header?
For example, I have many files which are already in UTF-8, which should be left alone:
<?xml version="1.0" encoding="utf-8"?>
However, I have a lot of files which do need to be converted:
<?xml version="1.0" encoding="windows-1255"?>
How can I detect the XML encoding specified in the headers of these files in Python? Better, after I detect and reencode the files, how then can I change this XML header to read "utf-8" to avoid processing it in the future?
Upvotes: 5
Views: 6910
Reputation: 3681
I want to expand on @PiotrDobrogost's answer and actually write a class that retrieves XML document encoding:
from xml.parsers import expat
class XmlParser(object):
'''class used to retrive xml documents encoding
'''
def get_encoding(self, xml):
self.__parse(xml)
return self.encoding
def __xml_decl_handler(self, version, encoding, standalone):
self.encoding = encoding
def __parse(self, xml):
parser = expat.ParserCreate()
parser.XmlDeclHandler = self.__xml_decl_handler
parser.Parse(xml)
And here is the example of its usage:
xml = """<?xml version='1.0' encoding='iso-8859-1'?>
<book>
<title>Title</title>
<chapter>Chapter 1</chapter>
</book>"""
parser = XmlParser()
encoding = parser.get_encoding(xml)
Upvotes: 0
Reputation: 42405
How can I detect the XML encoding specified in the headers of these files in Python?
A solution by Rob Wolfe using only the standard library:
from xml.parsers import expat
s = """<?xml version='1.0' encoding='iso-8859-1'?>
<book>
<title>Title</title>
<chapter>Chapter 1</chapter>
</book>"""
class MyParser(object):
def XmlDecl(self, version, encoding, standalone):
print "XmlDecl", version, encoding, standalone
def Parse(self, data):
Parser = expat.ParserCreate()
Parser.XmlDeclHandler = self.XmlDecl
Parser.Parse(data, 1)
parser = MyParser()
parser.Parse(s)
Upvotes: 2
Reputation: 1121158
Use lxml
to do the parsing; you can then access the original encoding with:
from lxml import etree
with open(filename, 'r') as xmlfile:
tree = etree.parse(xmlfile)
if tree.docinfo.encoding == 'utf-8':
# already in correct encoding, abort
return
You can then use lxml
to write the file out again in UTF-8.
Upvotes: 5