Reputation: 1550
I'm downloading a page with urllib2 and loading it into BeautifulSoup:
from bs4 import BeautifulSoup as Soup
import urllib2
baseHTML = 'http://forums.macrumors.com/'
baseForum = 'forumdisplay.php?f=109'
forumHTML = urllib2.urlopen(baseHTML+baseForum).read()
page = Soup(forumHTML)
print forumHTML
print page
When printing forumHTML
, all is well and the html that gets return is completely fine.
However, when printing page
, the HTML get garbled at this point:
<a href="showthread.php?t=324487" id="thread_title_324487">iPhone Tips and Tricks thread</a>
<span class="smallf">o n t " s t y l e = " w h i t e - s p a c e
As you can see, BeautifulSoup add a >
in the wrong place for some unknown reason.
Here's the same HTML inside forumHTML
:
<a href="showthread.php?t=324487" id="thread_title_324487">iPhone Tips and Tricks thread</a>
<span class="smallfont" style="white-space
Why would this happen? I'm using python 2.7 on Windows 64-bit, if that matters.
Upvotes: 2
Views: 434
Reputation: 845
I had a similar problem scraping on a Google places page, no > sign was added but I had the same issue with empty spaces introduced in the html code... and reinstalling BeautifoulSoup didn't make it work :)
Anyway, I went back to the BeautifulSoup4 documentation, read about the different HTML parsers that it supports, tried with the Python’s html.parser
from bs4 import BeautifulSoup
...
page = BeautifulSoup(markup, "html.parser")
and problem solved. If you are having this issue probably you'll need to use one of the supported HTML parser.
Upvotes: 1
Reputation: 1550
Having not found a solution for this for a long time, I decided to re-install BeautifulSoup - that somehow fixed the problem.
Upvotes: 1