Reputation: 399
Problem: BeautifulSoup
is not understanding img
tag as self closing when using 'html.parser'
from bs4 import BeautifulSoup
BeautifulSoup('<img src="" alt="" title="" class=""><span>kjrn</span>', 'html.parser')
gives me
<img alt="" class="" src="" title=""><span>kjrn</span></img>
but I want the result to be
<img alt="" class="" src="" title=""/><span>kjrn</span>
I cannot use xml
parser.
Upvotes: 3
Views: 512
Reputation: 8392
Use lxml
instead.
soup = BeautifulSoup('<img src="" alt="" title="" class=""><span>kjrn</span>', 'lxml')
Outputs:
<html><body><img alt="" class="" src="" title=""/><span>kjrn</span></body></html>
lxml
and html5lib
will attempt to create a well formed document, that is why you are seeing html and body tags.
Read more about parsers here.
There are also differences between HTML parsers. If you give Beautiful Soup a perfectly-formed HTML document, these differences won’t matter. One parser will be faster than another, but they’ll all give you a data structure that looks exactly like the original HTML document.
Upvotes: 2