George
George

Reputation: 2093

BeautifulSoup lxml parser closing tags where it shouldn't be

I'm using BeautifulSoup's lxml parser to parse some html. However, it's not being parsed as it's written. For instance, the following code:

import bs4

my_html = '''
<html>
<body>
<B>
<P>
Hello, I am some bolded text
</P>
</B>
</body>
</html>
'''

soup = bs4.BeautifulSoup(my_html, 'lxml')
print soup.prettify()

will print:

<html>
 <body>
  <b>
  </b>
  <p>
   Hello, I am some bolded text
  </p>
 </body>
</html>

You can see that somehow the <B> tag from my_html gets closed off before the <p> tag in the prettified version, even though it should be closed off after the </p>. Any ideas about what might be going on? I'm totally baffled.

Upvotes: 1

Views: 245

Answers (2)

Greg
Greg

Reputation: 597

This is because you can't have a <p> tag inside of a <b> tag, so the parser is trying to fix broken HTML. Using html5lib's html5lib parser or Python's html.parser will result in your expected output (I only know this because I just tested it).

Upvotes: 1

Sumit
Sumit

Reputation: 2387

That's because paragraphs are not allowed inside the <b> tag.

Only tags that accept flow content are allowed as the parent of <p> tags. See here for a list.

However, you can do the reverse; <p> is allowed as the parent for <b> tags. In your case, your can change your raw HTML to something like this:

my_html = '''
<html>
<body>
<p>
<b>
Hello, I am some bolded text
</b>
</p>
</body>
</html>
'''

Upvotes: 2

Related Questions