Josh Gibson
Josh Gibson

Reputation: 22968

What's the easiest way to escape HTML in Python?

cgi.escape seems like one possible choice. Does it work well? Is there something that is considered better?

Upvotes: 190

Views: 183150

Answers (9)

Maciej Ziarko
Maciej Ziarko

Reputation: 12124

In Python 3.2 a new html module was introduced, which is used for escaping reserved characters from HTML markup.

It has one function escape():

import html

print(html.escape('x > 2 && x < 7 single quote: \' double quote: "'))
x &gt; 2 &amp;&amp; x &lt; 7 single quote: &#x27; double quote: &quot;

Upvotes: 186

palestamp
palestamp

Reputation: 111

Not the easiest way, but still straightforward. The main difference from cgi.escape module - it still will work properly if you already have &amp; in your text. As you see from comments to it:

  • cgi.escape version
def escape(s, quote=None):
    '''Replace special characters "&", "<" and ">" to HTML-safe sequences.
    If the optional flag quote is true, the quotation mark character (")
    is also translated.'''
    s = s.replace("&", "&amp;") # Must be done first!
    s = s.replace("<", "&lt;")
    s = s.replace(">", "&gt;")
    if quote:
        s = s.replace('"', "&quot;")
    return s
  • regex version
QUOTE_PATTERN = r"""([&<>"'])(?!(amp|lt|gt|quot|#39);)"""
def escape(word):
    """
    Replaces special characters <>&"' to HTML-safe sequences. 
    With attention to already escaped characters.
    """
    replace_with = {
        '<': '&lt;',
        '>': '&gt;',
        '&': '&amp;',
        '"': '&quot;', # should be escaped in attributes
        "'": '&#39'    # should be escaped in attributes
    }
    quote_pattern = re.compile(QUOTE_PATTERN)
    return re.sub(quote_pattern, lambda x: replace_with[x.group(0)], word)

Upvotes: 2

nosklo
nosklo

Reputation: 223152

html.escape is the correct answer now, it used to be cgi.escape in python before 3.2. It escapes:

  • < to &lt;
  • > to &gt;
  • & to &amp;

That is enough for all HTML.

EDIT: If you have non-ascii chars you also want to escape, for inclusion in another encoded document that uses a different encoding, like Craig says, just use:

data.encode('ascii', 'xmlcharrefreplace')

Don't forget to decode data to unicode first, using whatever encoding it was encoded.

However in my experience that kind of encoding is useless if you just work with unicode all the time from start. Just encode at the end to the encoding specified in the document header (utf-8 for maximum compatibility).

Example:

>>> cgi.escape(u'<a>bá</a>').encode('ascii', 'xmlcharrefreplace')
'&lt;a&gt;b&#225;&lt;/a&gt;

Also worth of note (thanks Greg) is the extra quote parameter cgi.escape takes. With it set to True, cgi.escape also escapes double quote chars (") so you can use the resulting value in a XML/HTML attribute.

EDIT: Note that cgi.escape has been deprecated in Python 3.2 in favor of html.escape, which does the same except that quote defaults to True.

Upvotes: 213

speedplane
speedplane

Reputation: 16141

No libraries, pure python, safely escapes text into html text:

text.replace('&', '&amp;').replace('>', '&gt;').replace('<', '&lt;'
        ).replace('\'','&#39;').replace('"','&#34;').encode('ascii', 'xmlcharrefreplace')

Upvotes: 7

scharfmn
scharfmn

Reputation: 3661

For legacy code in Python 2.7, can do it via BeautifulSoup4:

>>> bs4.dammit import EntitySubstitution
>>> esub = EntitySubstitution()
>>> esub.substitute_html("r&d")
'r&amp;d'

Upvotes: 1

SuperFamousGuy
SuperFamousGuy

Reputation: 1575

If you wish to escape HTML in a URL:

This is probably NOT what the OP wanted (the question doesn't clearly indicate in which context the escaping is meant to be used), but Python's native library urllib has a method to escape HTML entities that need to be included in a URL safely.

The following is an example:

#!/usr/bin/python
from urllib import quote

x = '+<>^&'
print quote(x) # prints '%2B%3C%3E%5E%26'

Find docs here

Upvotes: 13

Craig McQueen
Craig McQueen

Reputation: 43486

cgi.escape should be good to escape HTML in the limited sense of escaping the HTML tags and character entities.

But you might have to also consider encoding issues: if the HTML you want to quote has non-ASCII characters in a particular encoding, then you would also have to take care that you represent those sensibly when quoting. Perhaps you could convert them to entities. Otherwise you should ensure that the correct encoding translations are done between the "source" HTML and the page it's embedded in, to avoid corrupting the non-ASCII characters.

Upvotes: 7

Brian M. Hunt
Brian M. Hunt

Reputation: 83858

There is also the excellent markupsafe package.

>>> from markupsafe import Markup, escape
>>> escape("<script>alert(document.cookie);</script>")
Markup(u'&lt;script&gt;alert(document.cookie);&lt;/script&gt;')

The markupsafe package is well engineered, and probably the most versatile and Pythonic way to go about escaping, IMHO, because:

  1. the return (Markup) is a class derived from unicode (i.e. isinstance(escape('str'), unicode) == True
  2. it properly handles unicode input
  3. it works in Python (2.6, 2.7, 3.3, and pypy)
  4. it respects custom methods of objects (i.e. objects with a __html__ property) and template overloads (__html_format__).

Upvotes: 10

JamesThomasMoon
JamesThomasMoon

Reputation: 7164

cgi.escape extended

This version improves cgi.escape. It also preserves whitespace and newlines. Returns a unicode string.

def escape_html(text):
    """escape strings for display in HTML"""
    return cgi.escape(text, quote=True).\
           replace(u'\n', u'<br />').\
           replace(u'\t', u'&emsp;').\
           replace(u'  ', u' &nbsp;')

for example

>>> escape_html('<foo>\nfoo\t"bar"')
u'&lt;foo&gt;<br />foo&emsp;&quot;bar&quot;'

Upvotes: 1

Related Questions