Reputation: 31
I want to parse in one project XML
and HTML
at the same time.
I tried this:
from xml.etree import ElementTree as ET
tree = ET.parse(fpath)
html_file = ET.parse(htmlpath)
and got this error:
Traceback (most recent call last): File "C:.py", line 55, in html_file = ET.parse("htmlpath") File "C:\Users\AppData\Local\Programs\Python\Python37-32\lib\xml\etree\ElementTree.py", line 1197, in parse tree.parse(source, parser) File "C:\Users\AppData\Local\Programs\Python\Python37-32\lib\xml\etree\ElementTree.py", line 598, in parse self._root = parser._parse_whole(source) xml.etree.ElementTree.ParseError: undefined entity
: line 690, column 78
Upvotes: 0
Views: 115
Reputation: 3271
The nbsp is a standard html5 entity. It may help to convert those to their unicode characters before running the xml parser. In python3.4+ you can use html.unescape
for that.
from html import escape, unescape
textXML = re.sub("\\&\\w+\\;", lambda x: escape(unescape(x.group(0))), text)
Upvotes: 0