Shankar Cabus
Shankar Cabus

Reputation: 9792

Handle HTML to remove and close open tags in Python

I'm trying to handle a HTML without closing tags or with invalid closing tags in Python with HTMLParser:

Entry:

<div>
  <p>foo 
</div>
bar</span>

Output: (closing open tags and open wrong closure)

<div>
  <p>foo</p>
</div>
<span>bar</span>

Or even: (removing closure without immediate opening and closing all open tags after)

<div>
  <p>foo bar</p>
</div>

My code only closes open tags, but can't edit HTML in the loop of HTMLParser.

from HTMLParser import HTMLParser

singleton_tags = [
  'area','base','br','col','command','embed','hr',
  'img', 'input','link','meta','param','source'
]

class HTMLParser_(HTMLParser):

    def __init__(self, *args, **kwargs):
        HTMLParser.__init__(self, *args, **kwargs)
        self.open_tags = []

    # Handle opening tag
    def handle_starttag(self, tag, attrs):
        if tag not in singleton_tags:
            self.open_tags.append(tag)

    # Handle closing tag
    def handle_endtag(self, tag):
        if tag not in singleton_tags:
            self.open_tags.pop()

def close_tags(text):
    parser = HTMLParser_()

    # Mounts stack of open tags
    parser.feed(text)

    # Closes open tags
    text += ''.join('</%s>'%tag for tag in parser.open_tags)

    return text

Upvotes: 1

Views: 1957

Answers (1)

Mike Christensen
Mike Christensen

Reputation: 91696

I would suggest looking into BeautifulSoup. It's hands down the best HTML parser I've used (for any language) and makes working with HTML quite easy in Python.

There's a prettify function that might be useful for you. Check out the section titled Printing a Document.

Upvotes: 2

Related Questions