Reputation: 2093
I'm using BeautifulSoup's lxml parser to parse some html. However, it's not being parsed as it's written. For instance, the following code:
import bs4
my_html = '''
<html>
<body>
<B>
<P>
Hello, I am some bolded text
</P>
</B>
</body>
</html>
'''
soup = bs4.BeautifulSoup(my_html, 'lxml')
print soup.prettify()
will print:
<html>
<body>
<b>
</b>
<p>
Hello, I am some bolded text
</p>
</body>
</html>
You can see that somehow the <B>
tag from my_html
gets closed off before the <p>
tag in the prettified version, even though it should be closed off after the </p>
. Any ideas about what might be going on? I'm totally baffled.
Upvotes: 1
Views: 245
Reputation: 597
This is because you can't have a <p>
tag inside of a <b>
tag, so the parser is trying to fix broken HTML. Using html5lib's html5lib
parser or Python's html.parser
will result in your expected output (I only know this because I just tested it).
Upvotes: 1
Reputation: 2387
That's because paragraphs are not allowed inside the <b>
tag.
Only tags that accept flow content are allowed as the parent of <p>
tags. See here for a list.
However, you can do the reverse; <p>
is allowed as the parent for <b>
tags. In your case, your can change your raw HTML to something like this:
my_html = '''
<html>
<body>
<p>
<b>
Hello, I am some bolded text
</b>
</p>
</body>
</html>
'''
Upvotes: 2