Wondercricket
Wondercricket

Reputation: 7882

How do I parse XML that contains HTML entities?

I have a script that takes XML as a string and attempts to parse it using xml

Here is an example of the code I am working with

from xml.etree.ElementTree import fromstring
my_xml = """
    <documents>
          <record>Hello< &O >World</record>
    </documents>
"""
xml = fromstring(my_xml)

When I run the code, I get a ParseError

Traceback (most recent call last):
  File "C:/Code/Python/xml_convert.py", line 7, in <module>
    xml = fromstring(my_xml)
  File "C:\Python27\lib\xml\etree\ElementTree.py", line 1300, in XML
    parser.feed(text)
  File "C:\Python27\lib\xml\etree\ElementTree.py", line 1642, in feed
    self._raiseerror(v)
  File "C:\Python27\lib\xml\etree\ElementTree.py", line 1506, in _raiseerror
    raise err
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 3, column 18

As stated in Invalid Characters in XML, it is due to having the HTML entities <, >, and &

How may I go about handling these entities so the XML reads them as plain text?

Upvotes: 1

Views: 1455

Answers (2)

Sede
Sede

Reputation: 61293

You can use the lxml Parser with the recover=True flag:

In [25]: import lxml.etree as ET

In [26]: from lxml.etree import XMLParser

In [27]: my_xml = """
   ....:     <documents>
   ....:           <record>Hello< &O >World</record>
   ....:     </documents>
   ....: """

In [28]: parser = XMLParser(recover=True)

In [29]: element = ET.fromstring(my_xml, parser=parser)

In [30]: for text in element.itertext():
   ....:     print(text)
   ....:     


Hello  >World

Upvotes: 4

Elliotte Rusty Harold
Elliotte Rusty Harold

Reputation: 991

You can't do what you're asking. Your document is not well-formed XML, and any conformant XML parser will reject it.

You could write code that used regular expressions to fix it up and make it XML, but any such solution would almost certainly be buggy and error-prone and cause more problems than it solves.

If you really, really can't fix this at the source so the documents are well-formed, then your best option may be to fix them up by hand with human intelligence.

Upvotes: 0

Related Questions