TheRajVJain
TheRajVJain

Reputation: 399

BeautifulSoup html.parser not understanding img tag

Problem: BeautifulSoup is not understanding img tag as self closing when using 'html.parser'

from bs4 import BeautifulSoup
BeautifulSoup('<img src="" alt="" title="" class=""><span>kjrn</span>', 'html.parser')

gives me

<img alt="" class="" src="" title=""><span>kjrn</span></img>

but I want the result to be

<img alt="" class="" src="" title=""/><span>kjrn</span>

I cannot use xml parser.

Upvotes: 3

Views: 512

Answers (1)

Zroq
Zroq

Reputation: 8392

Use lxml instead.

soup = BeautifulSoup('<img src="" alt="" title="" class=""><span>kjrn</span>', 'lxml')

Outputs:

<html><body><img alt="" class="" src="" title=""/><span>kjrn</span></body></html>

lxml and html5lib will attempt to create a well formed document, that is why you are seeing html and body tags.

Read more about parsers here.

There are also differences between HTML parsers. If you give Beautiful Soup a perfectly-formed HTML document, these differences won’t matter. One parser will be faster than another, but they’ll all give you a data structure that looks exactly like the original HTML document.

Upvotes: 2

Related Questions