Handle HTML to remove and close open tags in Python

Question

I'm trying to handle a HTML without closing tags or with invalid closing tags in Python with HTMLParser:

Entry:


  foo 

bar

Output: (closing open tags and open wrong closure)


  foo

bar

Or even: (removing closure without immediate opening and closing all open tags after)


  foo bar

My code only closes open tags, but can't edit HTML in the loop of HTMLParser.

from HTMLParser import HTMLParser

singleton_tags = [
  'area','base','br','col','command','embed','hr',
  'img', 'input','link','meta','param','source'
]

class HTMLParser_(HTMLParser):

    def __init__(self, *args, **kwargs):
        HTMLParser.__init__(self, *args, **kwargs)
        self.open_tags = []

    # Handle opening tag
    def handle_starttag(self, tag, attrs):
        if tag not in singleton_tags:
            self.open_tags.append(tag)

    # Handle closing tag
    def handle_endtag(self, tag):
        if tag not in singleton_tags:
            self.open_tags.pop()

def close_tags(text):
    parser = HTMLParser_()

    # Mounts stack of open tags
    parser.feed(text)

    # Closes open tags
    text += ''.join(''%tag for tag in parser.open_tags)

    return text

Mike Christensen · Accepted Answer

I would suggest looking into BeautifulSoup. It's hands down the best HTML parser I've used (for any language) and makes working with HTML quite easy in Python.

There's a prettify function that might be useful for you. Check out the section titled Printing a Document.

Handle HTML to remove and close open tags in Python

Answers (1)

Related Questions