How do I parse XML that contains HTML entities?

Question

I have a script that takes XML as a string and attempts to parse it using xml

Here is an example of the code I am working with

from xml.etree.ElementTree import fromstring
my_xml = """
    
          Hello< &O >World
    
"""
xml = fromstring(my_xml)

When I run the code, I get a ParseError

Traceback (most recent call last):
  File "C:/Code/Python/xml_convert.py", line 7, in 
    xml = fromstring(my_xml)
  File "C:\Python27\lib\xml\etree\ElementTree.py", line 1300, in XML
    parser.feed(text)
  File "C:\Python27\lib\xml\etree\ElementTree.py", line 1642, in feed
    self._raiseerror(v)
  File "C:\Python27\lib\xml\etree\ElementTree.py", line 1506, in _raiseerror
    raise err
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 3, column 18

As stated in Invalid Characters in XML, it is due to having the HTML entities <, >, and &

How may I go about handling these entities so the XML reads them as plain text?

Sede · Accepted Answer

You can use the lxml Parser with the recover=True flag:

In [25]: import lxml.etree as ET

In [26]: from lxml.etree import XMLParser

In [27]: my_xml = """
   ....:     
   ....:           Hello< &O >World
   ....:     
   ....: """

In [28]: parser = XMLParser(recover=True)

In [29]: element = ET.fromstring(my_xml, parser=parser)

In [30]: for text in element.itertext():
   ....:     print(text)
   ....:     


Hello  >World

How do I parse XML that contains HTML entities?

Answers (2)

Related Questions