eric
eric

Reputation: 1887

Escaping … with BeautifulSoup

I am currrently using BeautifulSoup to scrape some websites, however I have a problem with some specific characters, the code inside UnicodeDammit seems to indicate this (again) are some Microsoft-invented ones.

I'm using the newest version of BeautifulSoup(3.0.8.1) as I am still using python2.5

The following code illustrates my problem:

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup('...Baby One More Time (Digital Deluxe Version…')
print soup

'...Baby One More Time (Digital Deluxe Version…'

As you can see the problem is the '…'(&hellip) character at the end (which your browser probably escaped correctly). Obviously that's not what I am interested in.

It would be nice to have this characters unicode representation or something. Even sinmply ignoring it would solve my particular problem.

How can I do this with BeautifulSoup?

Upvotes: 1

Views: 2477

Answers (2)

eric
eric

Reputation: 1887

Found the solution myself:

soup = BeautifulSoup('...Baby One More Time (Digital Deluxe Version…', convertEntities="html")

Upvotes: 2

Gabe
Gabe

Reputation: 86718

MS may have invented it, but … is part of HTML 4: http://www.w3.org/TR/REC-html40/sgml/entities.html

Perhaps your Lib/htmlentitydefs.py is missing or out-of-date, as that's what BeautifulSoup uses to convert entities.

If you look at the Python 2.5 source tree you will clearly see it defined on line 126.

Upvotes: 1

Related Questions