Reputation: 10807
I have a string where special characters like '
or "
or &
(...) can appear. In the string:
string = """ Hello "XYZ" this 'is' a test & so on """
how can I automatically escape every special character, so that I get this:
string = " Hello "XYZ" this 'is' a test & so on "
Upvotes: 28
Views: 58910
Reputation: 4749
The other answers here will help with such as the characters you listed and a few others. However, if you also want to convert everything else to entity names, too, you'll have to do something else. For instance, if á
needs to be converted to á
, neither cgi.escape
nor html.escape
will help you there. You'll want to do something like this that uses html.entities.entitydefs
, which is just a dictionary. (The following code is made for Python 3.x, but there's a partial attempt at making it compatible with 2.x to give you an idea):
# -*- coding: utf-8 -*-
import sys
if sys.version_info[0]>2:
from html.entities import entitydefs
else:
from htmlentitydefs import entitydefs
text=";\"áèïøæỳ" #This is your string variable containing the stuff you want to convert
text=text.replace(";", "$ஸ$") #$ஸ$ is just something random the user isn't likely to have in the document. We're converting it so it doesn't convert the semi-colons in the entity name into entity names.
text=text.replace("$ஸ$", ";") #Converting semi-colons to entity names
if sys.version_info[0]>2: #Using appropriate code for each Python version.
for k,v in entitydefs.items():
if k not in {"semi", "amp"}:
text=text.replace(v, "&"+k+";") #You have to add the & and ; manually.
else:
for k,v in entitydefs.iteritems():
if k not in {"semi", "amp"}:
text=text.replace(v, "&"+k+";") #You have to add the & and ; manually.
#The above code doesn't cover every single entity name, although I believe it covers everything in the Latin-1 character set. So, I'm manually doing some common ones I like hereafter:
text=text.replace("ŷ", "ŷ")
text=text.replace("Ŷ", "Ŷ")
text=text.replace("ŵ", "ŵ")
text=text.replace("Ŵ", "Ŵ")
text=text.replace("ỳ", "ỳ")
text=text.replace("Ỳ", "Ỳ")
text=text.replace("ẃ", "&wacute;")
text=text.replace("Ẃ", "&Wacute;")
text=text.replace("ẁ", "ẁ")
text=text.replace("Ẁ", "Ẁ")
print(text)
#Python 3.x outputs: ;"áèïøæỳ
#The Python 2.x version outputs the wrong stuff. So, clearly you'll have to adjust the code somehow for it.
Upvotes: 2
Reputation: 523774
In Python 3.2, you could use the html.escape
function, e.g.
>>> string = """ Hello "XYZ" this 'is' a test & so on """
>>> import html
>>> html.escape(string)
' Hello "XYZ" this 'is' a test & so on '
For earlier versions of Python, check http://wiki.python.org/moin/EscapingHtml:
The
cgi
module that comes with Python has anescape()
function:import cgi s = cgi.escape( """& < >""" ) # s = "& < >"
However, it doesn't escape characters beyond
&
,<
, and>
. If it is used ascgi.escape(string_to_escape, quote=True)
, it also escapes"
.
Here's a small snippet that will let you escape quotes and apostrophes as well:
html_escape_table = { "&": "&", '"': """, "'": "'", ">": ">", "<": "<", } def html_escape(text): """Produce entities within text.""" return "".join(html_escape_table.get(c,c) for c in text)
You can also use
escape()
fromxml.sax.saxutils
to escape html. This function should execute faster. Theunescape()
function of the same module can be passed the same arguments to decode a string.from xml.sax.saxutils import escape, unescape # escape() and unescape() takes care of &, < and >. html_escape_table = { '"': """, "'": "'" } html_unescape_table = {v:k for k, v in html_escape_table.items()} def html_escape(text): return escape(text, html_escape_table) def html_unescape(text): return unescape(text, html_unescape_table)
Upvotes: 56
Reputation: 376052
A simple string function will do it:
def escape(t):
"""HTML-escape the text in `t`."""
return (t
.replace("&", "&").replace("<", "<").replace(">", ">")
.replace("'", "'").replace('"', """)
)
Other answers in this thread have minor problems: The cgi.escape method for some reason ignores single-quotes, and you need to explicitly ask it to do double-quotes. The wiki page linked does all five, but uses the XML entity '
, which isn't an HTML entity.
This code function does all five all the time, using HTML-standard entities.
Upvotes: 4
Reputation: 20695
The cgi.escape method will convert special charecters to valid html tags
import cgi
original_string = 'Hello "XYZ" this \'is\' a test & so on '
escaped_string = cgi.escape(original_string, True)
print original_string
print escaped_string
will result in
Hello "XYZ" this 'is' a test & so on
Hello "XYZ" this 'is' a test & so on
The optional second paramter on cgi.escape escapes quotes. By default, they are not escaped
Upvotes: 5