Reputation: 2361
Let's explain my issue by example:
from bs4 import BeautifulSoup
txt = """
<html>
<body>
<ul>
<li> 1
<li> 2
</ul>
</body>
"""
soup = BeautifulSoup(txt)
print(soup.prettify())
Here output of this script:
<html>
<body>
<ul>
<li>
1
<li>
2
</li>
</li>
</ul>
</body>
</html>
As you can see in the input html li
tags were not closed. BeautifulSoup fixed it in some way. But is it possible to configure BeautifulSoup to get this result on the output?
<html>
<body>
<ul>
<li>
1
</li>
<li>
2
</li>
</ul>
</body>
</html>
Upvotes: 0
Views: 84
Reputation: 1124968
The 'fixing' is applied by the parser used to load the HTML into the BeautifulSoup object tree.
You can swap out different parsers; broken HTML is repaired in different ways by different parsers. You'll have to install additional packages; by default only the html.parser
option is available.
I'd use the html5lib
parser here, it'll interpret non-standard HTML the same way a browser would, or you can try the lxml
parser:
>>> print BeautifulSoup(txt, 'html5lib').prettify()
<html>
<head>
</head>
<body>
<ul>
<li>
1
</li>
<li>
2
</li>
</ul>
</body>
</html>
>>> print BeautifulSoup(txt, 'lxml').prettify()
<html>
<body>
<ul>
<li>
1
</li>
<li>
2
</li>
</ul>
</body>
</html>
As you can see, both these produce the desired output.
It's only the default parser that exhibits this problem:
>>> print BeautifulSoup(txt, 'html.parser').prettify()
<html>
<body>
<ul>
<li>
1
<li>
2
</li>
</li>
</ul>
</body>
</html>
Upvotes: 2