GiantEnemyCrab
GiantEnemyCrab

Reputation: 38

Python xml.etree.ElemenTree, getting HTML entities

I am trying to analyze xml data, and encountered an issue with regard to HTML entities when I use

import xml.etree.ElementTree as ET
tree = ET.parse(my_xml_file)
root = tree.getroot()
for regex_rule in root.findall('.//regex_rule'):
  print(regex_rule.get('input')) #this ".get()" method turns &lt; into <, but I want to get &lt; as written
  print(regex_rule.get('input') == "(?&lt;!\S)hello(?!\S)") #prints out false because ElementTree's get method turns &lt; into < , is that right?

And here is the xml file contents:

<rules>
<regex_rule input="(?&lt;!\S)hello(?!\S)" output="world"/>
</rules>

I would appreciate if anybody can direct me to getting the string as is from the xml attribute for the input, without converting

&lt; 

into

<

Upvotes: 0

Views: 696

Answers (1)

atomicinf
atomicinf

Reputation: 3736

xml.etree.ElementTree is doing exactly the standards-compliant thing, which is to decode XML character entities with the understanding that they do in fact encode the referenced character and should be interpreted as such.

The preferred course of action if you do need to encode the literal &lt; is to change your input file to use &amp;lt; instead (i.e. we XML-encode the &).

If you can't change your input file format then you'll probably need to use a different module, or write your own parser: xml.etree.ElementTree translates entities well before you can do anything meaningful with the output.

Upvotes: 2

Related Questions