Naftuli Kay
Naftuli Kay

Reputation: 91590

Reading XML header encoding

I have a number of XML files I'd like to process with a script, converting them from whatever encoding that they're in to UTF-8.

Using the code given in this great answer I can do the conversion, but how can I read the encoding given in the XML header?

For example, I have many files which are already in UTF-8, which should be left alone:

<?xml version="1.0" encoding="utf-8"?>

However, I have a lot of files which do need to be converted:

<?xml version="1.0" encoding="windows-1255"?>

How can I detect the XML encoding specified in the headers of these files in Python? Better, after I detect and reencode the files, how then can I change this XML header to read "utf-8" to avoid processing it in the future?

Upvotes: 5

Views: 6910

Answers (3)

Mykhailo Seniutovych
Mykhailo Seniutovych

Reputation: 3681

I want to expand on @PiotrDobrogost's answer and actually write a class that retrieves XML document encoding:

from xml.parsers import expat

class XmlParser(object):
    '''class used to retrive xml documents encoding
    '''

    def get_encoding(self, xml):
        self.__parse(xml)
        return self.encoding

    def __xml_decl_handler(self, version, encoding, standalone):
        self.encoding = encoding

    def __parse(self, xml):
        parser = expat.ParserCreate()
        parser.XmlDeclHandler = self.__xml_decl_handler
        parser.Parse(xml)

And here is the example of its usage:

xml = """<?xml version='1.0' encoding='iso-8859-1'?>
    <book>
        <title>Title</title>
        <chapter>Chapter 1</chapter>
    </book>"""
parser = XmlParser()
encoding = parser.get_encoding(xml)

Upvotes: 0

Piotr Dobrogost
Piotr Dobrogost

Reputation: 42405

How can I detect the XML encoding specified in the headers of these files in Python?

A solution by Rob Wolfe using only the standard library:

from xml.parsers import expat

s = """<?xml version='1.0' encoding='iso-8859-1'?>
       <book>
           <title>Title</title>
           <chapter>Chapter 1</chapter>
       </book>"""

class MyParser(object):
    def XmlDecl(self, version, encoding, standalone):
        print "XmlDecl", version, encoding, standalone

    def Parse(self, data):
        Parser = expat.ParserCreate()
        Parser.XmlDeclHandler = self.XmlDecl
        Parser.Parse(data, 1)

parser = MyParser()
parser.Parse(s)

Upvotes: 2

Martijn Pieters
Martijn Pieters

Reputation: 1121158

Use lxml to do the parsing; you can then access the original encoding with:

from lxml import etree

with open(filename, 'r') as xmlfile:
    tree = etree.parse(xmlfile)
    if tree.docinfo.encoding == 'utf-8':
        # already in correct encoding, abort
        return

You can then use lxml to write the file out again in UTF-8.

Upvotes: 5

Related Questions