Reputation: 197
I am using Beautifulsoup (bs4) to scrap HTML page. It has list <ul>
which has <li>
that holds some interesting link (href).
Snippet:
<ul>
<!-- C 1-3 --></p>
<li>
<a href="http://LINK1" target="_blank">Link1 description</a>
</li>
</ul>
<ul>
<!-- E 1-2-3-6 --></p>
<li>
<a href="LINK-2" target="_blank">Link-2 description</a>
</li>
<p><!-- E 4-5 -7-8-9-10-11 --></p>
</ul>
Problem: When I use find_all()
to extract all the <ul>
- I am not getting it due the malformed ending </p>
which has missing opening <p>
. Browser ignores this and renders ok but BS4 messes up the parsing. Did anyone try to ignore any malformed tags in BS4 if present?
entries = soup.find_all(lambda x: x.name == 'ul')
print(len(entries))
print(entries[0])
1
<ul>
<!-- C 1-3 --></ul>
Upvotes: 3
Views: 1586
Reputation: 3130
I think you should try a more lenient parser for the HTML. For example:
soup = BeautifulSoup(pg, "html5lib")
For the html5lib parser is the most lenient parser. The advantages are:
Disadvantages are:
The documentation offers some explanation of the pros and cons of different parsers: https://beautiful-soup-4.readthedocs.org/en/latest/#installing-a-parser
Upvotes: 8