Reputation: 7577
I use an API to get some XML files but some of them contain HTML tags without escaping them. For example, <br>
or <b></b>
I use this code to read them, but the files with the HTML raise an error. I don't have access to change manually all the files. Is there any way to parse the file without losing the HTML tags?
from xml.dom.minidom import parse, parseString
xml = ...#here is the api to receive the xml file
dom = parse(xml)
strings = dom.getElementsByTagName("string")
Upvotes: 3
Views: 3578
Reputation: 814
If you can use third-party libs I suggest you to use Beautiful Soup it can handle xml as well as html and also it parses broken markup, also providing easy to use api.
Upvotes: 2
Reputation: 43196
Read the xml file as a string, and fix the malformed tags before you parse it:
import xml.etree.ElementTree as ET
with open(xml) as xml_file: # open the xml file for reading
text= xml_file.read() # read its contents
text= text.replace('<br>', '<br />') # fix malformed tags
document= ET.fromstring(text) # parse the string
strings= document.findall('string') # find all string elements
Upvotes: 2