Reputation: 7882
I have a script that takes XML as a string and attempts to parse it using xml
Here is an example of the code I am working with
from xml.etree.ElementTree import fromstring
my_xml = """
<documents>
<record>Hello< &O >World</record>
</documents>
"""
xml = fromstring(my_xml)
When I run the code, I get a ParseError
Traceback (most recent call last):
File "C:/Code/Python/xml_convert.py", line 7, in <module>
xml = fromstring(my_xml)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1300, in XML
parser.feed(text)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1642, in feed
self._raiseerror(v)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1506, in _raiseerror
raise err
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 3, column 18
As stated in Invalid Characters in XML, it is due to having the HTML entities <
, >
, and &
How may I go about handling these entities so the XML reads them as plain text?
Upvotes: 1
Views: 1455
Reputation: 61293
You can use the lxml
Parser with the recover=True
flag:
In [25]: import lxml.etree as ET
In [26]: from lxml.etree import XMLParser
In [27]: my_xml = """
....: <documents>
....: <record>Hello< &O >World</record>
....: </documents>
....: """
In [28]: parser = XMLParser(recover=True)
In [29]: element = ET.fromstring(my_xml, parser=parser)
In [30]: for text in element.itertext():
....: print(text)
....:
Hello >World
Upvotes: 4
Reputation: 991
You can't do what you're asking. Your document is not well-formed XML, and any conformant XML parser will reject it.
You could write code that used regular expressions to fix it up and make it XML, but any such solution would almost certainly be buggy and error-prone and cause more problems than it solves.
If you really, really can't fix this at the source so the documents are well-formed, then your best option may be to fix them up by hand with human intelligence.
Upvotes: 0