Kar
Kar

Reputation: 6365

Tags are converted to HTML entities?

I'm trying to use BeautifulSoup to parse some dirty HTML. One such HTML is http://f10.5post.com/forums/showthread.php?t=1142017

What happens is that, firstly, the tree misses a large chunk of the page. Secondly, tostring(tree) would convert tags like <div> on half of the page to HTML entities like &lt;/div&gt;. For instance

Original:

<div class="smallfont" align="centre">All times are GMT -4. The time now is <span class="time">02:12 PM</span>.</div>`

toString(tree) gives

&lt;div class="smallfont" align="center"&gt;All times are GMT -4. The time now is &lt;span class="time"&gt;02:12 PM&lt;/span&gt;.&lt;/div&gt;

Here's my code:

from BeautifulSoup import BeautifulSoup
import urllib2

page = urllib2.urlopen("http://f10.5post.com/forums/showthread.php?t=1142017")
soup = BeautifulSoup(page)

print soup

Thanks

Upvotes: 2

Views: 59

Answers (1)

alecxe
alecxe

Reputation: 474191

Use beautifulsoup4 and an extremely lenient html5lib parser:

import urllib2
from bs4 import BeautifulSoup  # NOTE: importing beautifulsoup4 here

page = urllib2.urlopen("http://f10.5post.com/forums/showthread.php?t=1142017")
soup = BeautifulSoup(page, "html5lib")

print soup

Upvotes: 1

Related Questions