Reputation: 21
The code below checks whether there more than one open html tag ,
from bs4 import BeautifulSoup
invalid = """<html>
<html>
</html>
</html>"""
soup = BeautifulSoup(invalid, 'html.parser')
print len(soup.find_all("html")) # prints 2
valid = """<html>
</html></html>"""
soup = BeautifulSoup(valid, 'html.parser')
print len(soup.find_all("html")) # prints 1
But How to check whether there is more than one closed html tag?
Upvotes: 1
Views: 382
Reputation: 279255
I wouldn't use BeautifulSoup
, because it's specifically a tag soup parser. It cleans up mis-matched open and close tags for you, that's part of the point.
Instead, use the parser that BeautifulSoup uses. There's a standard one in Python, called HTMLParser
in Python2 and html.parser
in Python3. If you've read the BeautifulSoup documentation you know that others are available, such as lxml
or html5lib
.
So for example:
import html.parser
class Parser(html.parser.HTMLParser):
count = 0
def handle_endtag(self, tag):
if tag == 'html':
self.count += 1
parser = Parser()
parser.feed('<html></html><!-- </html> --></html>')
parser.close()
print(parser.count)
Output:
2
Upvotes: 1