Reputation: 325
I'm trying to import a folder of ~15,000 xml files to a mongo db using python, specifically ElementTree. There seems to be an invalid character in about 5% of the files, mostly &. Document enconding is "ISO-8859-1" and the encoding is declared in the xml files.
Is there a build-in way to either omit the characer or automatically convert it to something valid?
Here is the code I'm using so far:
from pymongo import MongoClient
import xml.etree.ElementTree as ET
import os
import sys
def get_files(d):
return [os.path.join(d, f) for f in os.listdir(d) if os.path.isfile(os.path.join(d,f))]
files = get_files("/path/to/data")
xmls = []
for file in files:
tree = ET.parse(file)
root = tree.getroot()
xmls.append(root)
#Results in:
In [113]: xmls = []
...: for file in files:
...: tree = ET.parse(file)
...: root = tree.getroot()
...: xmls.append(root)
File "<string>", line unknown
ParseError: not well-formed (invalid token): line 223, column 74
Sure enough, there is an & on line 223, col 74 of the document that was to be parsed next.
Upvotes: 2
Views: 5007
Reputation: 325
For closure, here is what I went with:
Instead of using ElementTree, I used lxml with its recover option:
for file in files:
parser = etree.XMLParser(ns_clean=True, recover = True)
tree = etree.parse(file, parser=parser)
root = tree.getroot()
xmls.append(root)
This does not fix the underlying problem but is sufficient for the task at hand.
Upvotes: 5