BeautifulSoup lxml parser closing tags where it shouldn't be

Question

I'm using BeautifulSoup's lxml parser to parse some html. However, it's not being parsed as it's written. For instance, the following code:

import bs4

my_html = '''




Hello, I am some bolded text




'''

soup = bs4.BeautifulSoup(my_html, 'lxml')
print soup.prettify()

will print:


 
  
  
  
   Hello, I am some bolded text

You can see that somehow the tag from my_html gets closed off before the

tag in the prettified version, even though it should be closed off after the

. Any ideas about what might be going on? I'm totally baffled.

Greg · Accepted Answer

This is because you can't have a

tag inside of a tag, so the parser is trying to fix broken HTML. Using html5lib's html5lib parser or Python's html.parser will result in your expected output (I only know this because I just tested it).

BeautifulSoup lxml parser closing tags where it shouldn't be

Answers (2)

Related Questions

BeautifulSoup lxml parser closing tags where it shouldn&#39;t be

Answers (2)

Related Questions

BeautifulSoup lxml parser closing tags where it shouldn't be