pataluc
pataluc

Reputation: 589

BeautifulSoup4 : Ampersand in text

I have a problem using BeautifulSoup4... (I'm quite a Python/BeautifulSoup newbie, so forgive me if i'm dumb)

Why does the following code:

from bs4 import BeautifulSoup

soup_ko = BeautifulSoup('<select><option>foo</option><option>bar & baz</option><option>qux</option></select>')
soup_ok = BeautifulSoup('<select><option>foo</option><option>bar and baz</option><option>qux</option></select>')

print soup_ko.find_all('option')
print soup_ok.find_all('option')

produce the following output:

[<option>foo</option>, <option>bar &amp; baz</option>]
[<option>foo</option>, <option>bar and baz</option>, <option>qux</option>]

i was expecting the same result, an array of my 3 options... but BeautifulSoup seems to dislike the ampersand in the text? How can i get rid of this and get a correct array without editing my HTML (or by transforming/converting it)?

thanks,

Edit: Seems like a 4.2.0 bug... i downloaded both 4.2.0 and 4.2.1 versions (from http://www.crummy.com/software/BeautifulSoup/bs4/download/4.2/beautifulsoup4-4.2.0.tar.gz and http://www.crummy.com/software/BeautifulSoup/bs4/download/4.2/beautifulsoup4-4.2.1.tar.gz), unzip it in my script folder, change my code to:

import sys
sys.path.insert(0, "beautifulsoup4-" + sys.argv[1])
from bs4 import BeautifulSoup, __version__

print "Beautiful Soup %s" % __version__
soup_ko = BeautifulSoup('<select><option>foo</option><option>bar & baz</option><option>qux</option></select>')
print soup_ko.find_all('option')

and got the results:

15:24:38 pataluc ~ % python stack.py 4.2.0
Beautiful Soup 4.2.0
[<option>foo</option>, <option>bar &amp; baz</option>]
15:24:41 pataluc ~ % python stack.py 4.2.1
Beautiful Soup 4.2.1
[<option>foo</option>, <option>bar &amp; baz</option>, <option>qux</option>]

so i guess my question is closed. thanks for your comments who made me realize it was a version issue.

Upvotes: 1

Views: 3238

Answers (4)

Sergiy Maksymenko
Sergiy Maksymenko

Reputation: 161

This solution works for me. It is taken from here:

print(soup.prettify(formatter=None))

Upvotes: 0

Eli Borodach
Eli Borodach

Reputation: 597

As written before & is part of HTML language, but you can use html.escape before the BeautifulSoup and html.unesacpe afterwards if necessary

Upvotes: 1

pataluc
pataluc

Reputation: 589

As i said in the edited first post, it was a bug in BeautifulSoup 4.2.0, i downloaded 4.2.1 and the bug is gone.

Upvotes: 1

kirelagin
kirelagin

Reputation: 13616

& is used in HTML to input so called HTML entities. E.g., < is a special symbol in HTML because it starts a tag, so you use &lt; instead.

Thus, & itself is also a special symbol, and you should use &amp; for a literal ampersand. Your HTML was invalid and BeautifulSoup fixed it.

Upvotes: 2

Related Questions