Reputation: 1864
I'm using BeautifulSoup to escape all of the HTML tags (except for a set of pre-approved tags, like a) from an arbitrary set of text. However, I only want it to escape the tags if they are actual valid HTML tags. If something looks like a tag, but isn't, it ends up adding some HTML to close it off, which I don't want.
Example: If someone enters in the text <integer>
, my code ends up spitting out <integer></integer>
instead of just <integer>
Here's the code (value
is the HTML string and VALID_TAGS
is just a list of acceptable tag names).
soup = BeautifulSoup.BeautifulSoup(
value, convertEntities=BeautifulSoup.BeautifulSoup.HTML_ENTITIES)
# Loop through all the tags. If it is invalid, escape the characters.
for tag in soup.findAll():
if tag.name not in VALID_TAGS:
tag.replaceWith(cgi.escape(str(tag)))
return soup.renderContents()
Thanks in advance.
Upvotes: 2
Views: 1076
Reputation: 1864
Figured this out using html5lib based on this answer as a starting point. Here's a version of what I ended up with that does the same thing as the BeautifulSoup code I started with above, except works properly for the <integer>
case I described:
p = html5lib.HTMLParser(tokenizer=sanitizer.HTMLSanitizer, tree=treebuilders.getTreeBuilder("dom"))
dom_tree = p.parseFragment(value)
walker = treewalkers.getTreeWalker("dom")
stream = walker(dom_tree)
s = serializer.htmlserializer.HTMLSerializer(quote_attr_values=True)
return s.render(stream)
Thanks to everyone who helped.
Upvotes: 1
Reputation: 34517
You are doing it wrong (tm). BeatifulSoup is not meant to be used like that. Take a look at this instead: http://code.activestate.com/recipes/52281-strip-tags-and-javascript-from-html-page-leaving-o/ This recipe removes invalid tags and you sound like you want to keep them in but escaped. Should be a pretty easy modification.
Upvotes: 0