Python parse XML files with HTML content

Question

I use an API to get some XML files but some of them contain HTML tags without escaping them. For example, or

I use this code to read them, but the files with the HTML raise an error. I don't have access to change manually all the files. Is there any way to parse the file without losing the HTML tags?

from xml.dom.minidom import parse, parseString

xml = ...#here is the api to receive the xml file
dom = parse(xml)
strings = dom.getElementsByTagName("string")

Aran-Fey · Accepted Answer

Read the xml file as a string, and fix the malformed tags before you parse it:

import xml.etree.ElementTree as ET

with open(xml) as xml_file: # open the xml file for reading
    text= xml_file.read() # read its contents
text= text.replace('
', '
') # fix malformed tags
document= ET.fromstring(text) # parse the string
strings= document.findall('string') # find all string elements

Python parse XML files with HTML content

Answers (2)

Related Questions