Reputation: 6365
I'm trying to use BeautifulSoup to parse some dirty HTML. One such HTML is http://f10.5post.com/forums/showthread.php?t=1142017
What happens is that, firstly, the tree misses a large chunk of the page. Secondly, tostring(tree)
would convert tags like <div>
on half of the page to HTML entities like </div>
. For instance
Original:
<div class="smallfont" align="centre">All times are GMT -4. The time now is <span class="time">02:12 PM</span>.</div>`
toString(tree)
gives
<div class="smallfont" align="center">All times are GMT -4. The time now is <span class="time">02:12 PM</span>.</div>
Here's my code:
from BeautifulSoup import BeautifulSoup
import urllib2
page = urllib2.urlopen("http://f10.5post.com/forums/showthread.php?t=1142017")
soup = BeautifulSoup(page)
print soup
Thanks
Upvotes: 2
Views: 59
Reputation: 474191
Use beautifulsoup4
and an extremely lenient html5lib
parser:
import urllib2
from bs4 import BeautifulSoup # NOTE: importing beautifulsoup4 here
page = urllib2.urlopen("http://f10.5post.com/forums/showthread.php?t=1142017")
soup = BeautifulSoup(page, "html5lib")
print soup
Upvotes: 1