How to remove invalid characters when parsing xml using ElementTree (python)

Question

I'm trying to import a folder of ~15,000 xml files to a mongo db using python, specifically ElementTree. There seems to be an invalid character in about 5% of the files, mostly &. Document enconding is "ISO-8859-1" and the encoding is declared in the xml files.

Is there a build-in way to either omit the characer or automatically convert it to something valid?

Here is the code I'm using so far:

    from pymongo import MongoClient
    import xml.etree.ElementTree as ET
    import os
    import sys


    def get_files(d):
            return [os.path.join(d, f) for f in os.listdir(d) if os.path.isfile(os.path.join(d,f))]

    files = get_files("/path/to/data")

    xmls = []
    for file in files:
        tree = ET.parse(file)
                root = tree.getroot()
        xmls.append(root)


    #Results in:
    In [113]: xmls = []
         ...: for file in files:
         ...:     tree = ET.parse(file)
         ...:     root = tree.getroot()
         ...:     xmls.append(root)
      File "", line unknown
    ParseError: not well-formed (invalid token): line 223, column 74

Sure enough, there is an & on line 223, col 74 of the document that was to be parsed next.

Matthias · Accepted Answer

For closure, here is what I went with:

Instead of using ElementTree, I used lxml with its recover option:

for file in files:
    parser = etree.XMLParser(ns_clean=True, recover = True)
    tree = etree.parse(file, parser=parser)
    root = tree.getroot()
    xmls.append(root)

This does not fix the underlying problem but is sufficient for the task at hand.

How to remove invalid characters when parsing xml using ElementTree (python)

Answers (1)

Related Questions