Reputation: 9792
I'm trying to handle a HTML without closing tags or with invalid closing tags in Python with HTMLParser:
Entry:
<div>
<p>foo
</div>
bar</span>
Output: (closing open tags and open wrong closure)
<div>
<p>foo</p>
</div>
<span>bar</span>
Or even: (removing closure without immediate opening and closing all open tags after)
<div>
<p>foo bar</p>
</div>
My code only closes open tags, but can't edit HTML in the loop of HTMLParser.
from HTMLParser import HTMLParser
singleton_tags = [
'area','base','br','col','command','embed','hr',
'img', 'input','link','meta','param','source'
]
class HTMLParser_(HTMLParser):
def __init__(self, *args, **kwargs):
HTMLParser.__init__(self, *args, **kwargs)
self.open_tags = []
# Handle opening tag
def handle_starttag(self, tag, attrs):
if tag not in singleton_tags:
self.open_tags.append(tag)
# Handle closing tag
def handle_endtag(self, tag):
if tag not in singleton_tags:
self.open_tags.pop()
def close_tags(text):
parser = HTMLParser_()
# Mounts stack of open tags
parser.feed(text)
# Closes open tags
text += ''.join('</%s>'%tag for tag in parser.open_tags)
return text
Upvotes: 1
Views: 1957
Reputation: 91696
I would suggest looking into BeautifulSoup. It's hands down the best HTML parser I've used (for any language) and makes working with HTML quite easy in Python.
There's a prettify
function that might be useful for you. Check out the section titled Printing a Document.
Upvotes: 2