Reputation: 589
I have a problem using BeautifulSoup4... (I'm quite a Python/BeautifulSoup newbie, so forgive me if i'm dumb)
Why does the following code:
from bs4 import BeautifulSoup
soup_ko = BeautifulSoup('<select><option>foo</option><option>bar & baz</option><option>qux</option></select>')
soup_ok = BeautifulSoup('<select><option>foo</option><option>bar and baz</option><option>qux</option></select>')
print soup_ko.find_all('option')
print soup_ok.find_all('option')
produce the following output:
[<option>foo</option>, <option>bar & baz</option>]
[<option>foo</option>, <option>bar and baz</option>, <option>qux</option>]
i was expecting the same result, an array of my 3 options... but BeautifulSoup seems to dislike the ampersand in the text? How can i get rid of this and get a correct array without editing my HTML (or by transforming/converting it)?
thanks,
Edit: Seems like a 4.2.0 bug... i downloaded both 4.2.0 and 4.2.1 versions (from http://www.crummy.com/software/BeautifulSoup/bs4/download/4.2/beautifulsoup4-4.2.0.tar.gz and http://www.crummy.com/software/BeautifulSoup/bs4/download/4.2/beautifulsoup4-4.2.1.tar.gz), unzip it in my script folder, change my code to:
import sys
sys.path.insert(0, "beautifulsoup4-" + sys.argv[1])
from bs4 import BeautifulSoup, __version__
print "Beautiful Soup %s" % __version__
soup_ko = BeautifulSoup('<select><option>foo</option><option>bar & baz</option><option>qux</option></select>')
print soup_ko.find_all('option')
and got the results:
15:24:38 pataluc ~ % python stack.py 4.2.0
Beautiful Soup 4.2.0
[<option>foo</option>, <option>bar & baz</option>]
15:24:41 pataluc ~ % python stack.py 4.2.1
Beautiful Soup 4.2.1
[<option>foo</option>, <option>bar & baz</option>, <option>qux</option>]
so i guess my question is closed. thanks for your comments who made me realize it was a version issue.
Upvotes: 1
Views: 3238
Reputation: 161
This solution works for me. It is taken from here:
print(soup.prettify(formatter=None))
Upvotes: 0
Reputation: 597
As written before & is part of HTML language, but you can use html.escape before the BeautifulSoup and html.unesacpe afterwards if necessary
Upvotes: 1
Reputation: 589
As i said in the edited first post, it was a bug in BeautifulSoup 4.2.0, i downloaded 4.2.1 and the bug is gone.
Upvotes: 1
Reputation: 13616
&
is used in HTML to input so called HTML entities. E.g., <
is a special symbol in HTML because it starts a tag, so you use <
instead.
Thus, &
itself is also a special symbol, and you should use &
for a literal ampersand. Your HTML was invalid and BeautifulSoup fixed it.
Upvotes: 2