Reputation: 42033
I'd like to strip all html / javascript except for:
<b></b>
<ul></ul>
<li></li>
<a></a>
Thanks.
Upvotes: 1
Views: 462
Reputation: 2057
While I agree with Laurence, there are occasions where a quick and dirty 99% approach gets the job done without creating other problems.
Here's an example that demonstrates a regex based approach --
import re
CLEANBODY_RE = re.compile(r'<(/?)(.+?)>', re.M)
def _repl(match):
tag = match.group(2).split(' ')[0]
if tag == 'p':
return '<%sp>' % match.group(1)
elif tag in ('a', 'br', 'ul', 'li', 'b', 'strong', 'em', 'i'):
return match.group(0)
return u''
def cleanbody(html):
return CLEANBODY_RE.sub(_repl, html)
Upvotes: 2
Reputation: 143154
Do you want a way that's fast or a way that's correct? A regex-based approach is unlikely to be correct and may open you up to XSS attacks.
You should use an HTML parser like Beautiful Soup or even htmllib
.
Also, <a>
can contain javascript:
href
s and there are also the various on
* attributes which are javascript. You probably want to strip all of those out. In general, a whitelist approach is best: only keep attributes (and attribute values) you know are safe.
Upvotes: 4
Reputation: 126
Replace the elements you want to keep with a place holder value, then regex out any remaining <.*>, finally replace the placeholders with the corresponding html elements.
Upvotes: 0