djh
djh

Reputation: 410

Python xml.etree escaping

When using python's xml.etree module, how can I escape xml-special characters like '>' and '<' to be used inside a tag? Must I do so manually? Does etree have a method or kwarg that I am missing?

Consider:

In [1]: from xml.etree.ElementTree import Element, SubElement, tostring

In [2]: root = Element('filter')

In [3]: root.set('type', 'test')

In [4]: for op in ['<', '>', '=']:
   ...:     sub_elem = SubElement(root, op)
   ...:     child = Element('a')
   ...:     child.text = 'b'
   ...:     sub_elem.append(child)
   ...:

In [5]: tostring(root)
Out[5]: '<filter type="test"><<><a>b</a></<><>><a>b</a></>><=><a>b</a></=></filter>'

Where I would like to see sections like:

<&lt><a>b</a></&lt>

Upvotes: 1

Views: 4274

Answers (2)

mzjn
mzjn

Reputation: 50947

Where I would like to see sections like:

<&lt><a>b</a></&lt>

This is not well-formed XML. I guess that you forgot the semicolons, but adding them does not help. The following is also ill-formed:

<&lt;><a>b</a></&lt;>

In the code, you are trying to create elements called <, >, and =. That won't work. All of the following are forbidden in XML element names: <, >, =, &gt;, &lt;.

Unfortunately, ElementTree is a bit lax and allows you to create pseudo-XML, such as this (from the question):

<filter type="test"><<><a>b</a></<><>><a>b</a></>><=><a>b</a></=></f‌​ilter>

If you had used lxml.etree (see http://lxml.de) instead of xml.etree.ElementTree, you would have received an error message: "ValueError: Invalid tag name u'<'".

Upvotes: 1

denvaar
denvaar

Reputation: 2214

< and > are not valid characters in XML, and should instead be replaced with &lt; and &gt; respectively.

You can use a regular expression to replace the characters that are invalid:

import re

regexp = re.compile(r'<|>')  # here we are making a regex to catch either the character '<' or '>'
replacement_map = {'<': '&lt;', '>': '&gt;'}  # a dict to map a character to the replacement value.
regexp.sub(lambda match: replacement_map[match.group(0)], '<a>hello</a>')  # do the replacement

# output: '&lt;a&gt;hello&lt;/a&gt;'

Though the code a a little more involoved, it is a very efficient way of doing the replacements.

Upvotes: 2

Related Questions