How to prevent BeautifulSoup from self-closing things that look like tags but aren't?

Question

I'm using BeautifulSoup to escape all of the HTML tags (except for a set of pre-approved tags, like a) from an arbitrary set of text. However, I only want it to escape the tags if they are actual valid HTML tags. If something looks like a tag, but isn't, it ends up adding some HTML to close it off, which I don't want.

Example: If someone enters in the text , my code ends up spitting out <integer></integer> instead of just <integer>

Here's the code (value is the HTML string and VALID_TAGS is just a list of acceptable tag names).

soup = BeautifulSoup.BeautifulSoup(
  value, convertEntities=BeautifulSoup.BeautifulSoup.HTML_ENTITIES)
# Loop through all the tags. If it is invalid, escape the characters.
for tag in soup.findAll():
  if tag.name not in VALID_TAGS:
    tag.replaceWith(cgi.escape(str(tag)))
return soup.renderContents()

Thanks in advance.

Chad · Accepted Answer

Figured this out using html5lib based on this answer as a starting point. Here's a version of what I ended up with that does the same thing as the BeautifulSoup code I started with above, except works properly for the case I described:

p = html5lib.HTMLParser(tokenizer=sanitizer.HTMLSanitizer, tree=treebuilders.getTreeBuilder("dom"))
dom_tree = p.parseFragment(value)
walker = treewalkers.getTreeWalker("dom")
stream = walker(dom_tree)
s = serializer.htmlserializer.HTMLSerializer(quote_attr_values=True)
return s.render(stream)

Thanks to everyone who helped.

How to prevent BeautifulSoup from self-closing things that look like tags but aren't?

Answers (2)

Related Questions

How to prevent BeautifulSoup from self-closing things that look like tags but aren&#39;t?

Answers (2)

Related Questions

How to prevent BeautifulSoup from self-closing things that look like tags but aren't?