Jake
Jake

Reputation: 145

Beautifulsoup special character parsing error

I am using Beautiful Soup and urllib2 for collecting contents from internet. This is the code i am using.

from bs4 import BeautifulSoup
import urllib2

html = urllib2.urlopen('http://plrplr.com/33717/mp3-player-guide/').read()
soup = BeautifulSoup(html, "lxml")
contents = soup.find('div', {'class': 'entry-content'})
print contents

But I am getting results like this...

<div class="entry-content">
<p>MP3 player, also well known as digital audio player has become a staple of our gadget life. There are many brands of MP3 players on the market today. So, which MP3 player are the most suitable for you? That’s where this MP3 player guide comes in. <br/>
Basically, there are 3 types of MP3 player based on capacity: – <br/>
1. Hard drive MP3 player <br/>
– highest capacity <br/>
– largest in size <br/>
– heavy <br/>
– often labeled as an “Jukebox MP3 player� <br/>
– has moving parts <br/>
– example: Apple iPod video, Sony Network Walkman NW-HD5 <br/>

There is problem when dealing with special charector.

How i can get exact source code like this...

    <div class="entry-content">
        <p>MP3 player, also well known as digital audio player has become a staple of our gadget life. There are many brands of MP3 players on the market today. So, which MP3 player are the most suitable for you? That&#8217;s where this MP3 player guide comes in. </br><br />
Basically, there are 3 types of MP3 player based on capacity: &#8211; </br><br />
1. Hard drive MP3 player </br><br />
&#8211; highest capacity </br><br />
&#8211; largest in size </br><br />
&#8211; heavy </br><br />
&#8211; often labeled as an &#8220;Jukebox MP3 player&#8221; </br><br />
&#8211; has moving parts </br><br />
&#8211; example: Apple iPod video, Sony Network Walkman NW-HD5 </br><br />

I am running this code in Windows 8 machine using Eclipse and pydev.

Upvotes: 2

Views: 2538

Answers (1)

wigy
wigy

Reputation: 2222

Probably what you are looking for is contents.prettify(formatter="html") to show entity codes instead of non-ascii letters?

I could not test that on my machine, but here are the docs I used: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#output-formatters

Upvotes: 2

Related Questions