jbkkd
jbkkd

Reputation: 1550

HTML garbled after using BeautifulSoup

I'm downloading a page with urllib2 and loading it into BeautifulSoup:

from bs4 import BeautifulSoup as Soup
import urllib2
baseHTML = 'http://forums.macrumors.com/'
baseForum = 'forumdisplay.php?f=109'
forumHTML = urllib2.urlopen(baseHTML+baseForum).read()
page = Soup(forumHTML)
print forumHTML
print page

When printing forumHTML, all is well and the html that gets return is completely fine.

However, when printing page, the HTML get garbled at this point:

<a href="showthread.php?t=324487" id="thread_title_324487">iPhone Tips and Tricks thread</a>
<span class="smallf">o n t "   s t y l e = " w h i t e - s p a c e 

As you can see, BeautifulSoup add a > in the wrong place for some unknown reason. Here's the same HTML inside forumHTML:

<a href="showthread.php?t=324487" id="thread_title_324487">iPhone Tips and Tricks thread</a>
<span class="smallfont" style="white-space

Why would this happen? I'm using python 2.7 on Windows 64-bit, if that matters.

Upvotes: 2

Views: 434

Answers (2)

supita
supita

Reputation: 845

I had a similar problem scraping on a Google places page, no > sign was added but I had the same issue with empty spaces introduced in the html code... and reinstalling BeautifoulSoup didn't make it work :)

Anyway, I went back to the BeautifulSoup4 documentation, read about the different HTML parsers that it supports, tried with the Python’s html.parser

from bs4 import BeautifulSoup

...

page = BeautifulSoup(markup, "html.parser")

and problem solved. If you are having this issue probably you'll need to use one of the supported HTML parser.

Upvotes: 1

jbkkd
jbkkd

Reputation: 1550

Having not found a solution for this for a long time, I decided to re-install BeautifulSoup - that somehow fixed the problem.

Upvotes: 1

Related Questions