Chad
Chad

Reputation: 1864

How to prevent BeautifulSoup from self-closing things that look like tags but aren't?

I'm using BeautifulSoup to escape all of the HTML tags (except for a set of pre-approved tags, like a) from an arbitrary set of text. However, I only want it to escape the tags if they are actual valid HTML tags. If something looks like a tag, but isn't, it ends up adding some HTML to close it off, which I don't want.

Example: If someone enters in the text <integer>, my code ends up spitting out &lt;integer&gt;&lt;/integer&gt; instead of just &lt;integer&gt;

Here's the code (value is the HTML string and VALID_TAGS is just a list of acceptable tag names).

soup = BeautifulSoup.BeautifulSoup(
  value, convertEntities=BeautifulSoup.BeautifulSoup.HTML_ENTITIES)
# Loop through all the tags. If it is invalid, escape the characters.
for tag in soup.findAll():
  if tag.name not in VALID_TAGS:
    tag.replaceWith(cgi.escape(str(tag)))
return soup.renderContents()

Thanks in advance.

Upvotes: 2

Views: 1076

Answers (2)

Chad
Chad

Reputation: 1864

Figured this out using html5lib based on this answer as a starting point. Here's a version of what I ended up with that does the same thing as the BeautifulSoup code I started with above, except works properly for the <integer> case I described:

p = html5lib.HTMLParser(tokenizer=sanitizer.HTMLSanitizer, tree=treebuilders.getTreeBuilder("dom"))
dom_tree = p.parseFragment(value)
walker = treewalkers.getTreeWalker("dom")
stream = walker(dom_tree)
s = serializer.htmlserializer.HTMLSerializer(quote_attr_values=True)
return s.render(stream)

Thanks to everyone who helped.

Upvotes: 1

MK.
MK.

Reputation: 34517

You are doing it wrong (tm). BeatifulSoup is not meant to be used like that. Take a look at this instead: http://code.activestate.com/recipes/52281-strip-tags-and-javascript-from-html-page-leaving-o/ This recipe removes invalid tags and you sound like you want to keep them in but escaped. Should be a pretty easy modification.

Upvotes: 0

Related Questions