Tasos
Tasos

Reputation: 7577

Python parse XML files with HTML content

I use an API to get some XML files but some of them contain HTML tags without escaping them. For example, <br> or <b></b>

I use this code to read them, but the files with the HTML raise an error. I don't have access to change manually all the files. Is there any way to parse the file without losing the HTML tags?

from xml.dom.minidom import parse, parseString

xml = ...#here is the api to receive the xml file
dom = parse(xml)
strings = dom.getElementsByTagName("string")

Upvotes: 3

Views: 3578

Answers (2)

sepulchered
sepulchered

Reputation: 814

If you can use third-party libs I suggest you to use Beautiful Soup it can handle xml as well as html and also it parses broken markup, also providing easy to use api.

Upvotes: 2

Aran-Fey
Aran-Fey

Reputation: 43196

Read the xml file as a string, and fix the malformed tags before you parse it:

import xml.etree.ElementTree as ET

with open(xml) as xml_file: # open the xml file for reading
    text= xml_file.read() # read its contents
text= text.replace('<br>', '<br />') # fix malformed tags
document= ET.fromstring(text) # parse the string
strings= document.findall('string') # find all string elements

Upvotes: 2

Related Questions