Reputation: 3247
I've read that BeautifulSoup has problems with ampersands (&) which are not strictly correct in HTML but still interpreted correctly by most browsers. However weirdly I'm getting different behaviour on a Mac system and on a Ubuntu system, both using bs4 version 4.3.2:
html='<td>S&P500</td>'
s=bs4.BeautifulSoup(html)
On the Ubuntu system s is equal to:
<td>S&P500;</td>
Notice the added semicolon at the end which is a real problem
On the mac system:
<html><head></head><body>S&P500</body></html>
Never mind the html/head/body tags, I can deal with that, but notice S&P 500 is correctly interpreted this time, without the added ";".
Any idea what's going on? How to make cross-platform code without resorting to an ugly hack? Thanks a lot,
Upvotes: 0
Views: 727
Reputation: 4129
First I can't reproduce the mac results using python2.7.1 and beautifulsoup4.3.2, that is I am getting the extra semicolon on all systems.
The easy fix is a) use strictly valid HTML, or b) add a space after the ampersand. Chances are you can't change the source, and if you could parse out and replace these in python you wouldn't be needing BeautifulSoup ;)
So the problem is that the BeautifulSoupHTMLParser first converts S&P500
to S&P500;
because it assumes P500
is the character name and you just forgot the semicolon.
Then later it reparses the string and finds &P500;
. Now it doesn't recognize P500
as a valid name and converts the &
to &
without touching the rest.
Here is a stupid monkeypatch only to demonstrate my point. I don't know the inner workings of BeautifulSoup well enough to propose a proper solution.
from bs4 import BeautifulSoup
from bs4.builder._htmlparser import BeautifulSoupHTMLParser
from bsp.dammit import EntitySubstitution
def handle_entityref(self, name):
character = EntitySubstitution.HTML_ENTITY_TO_CHARACTER.get(name)
if character is not None:
data = character
else:
# Previously was
# data = "&%s;" % name
data = "&%s" % name
self.handle_data(data)
html = '<td>S&P500</td>'
# Pre monkeypatching
# <td>S&P500;</td>
print(BeautifulSoup(html))
BeautifulSoupHTMLParser.handle_entityref = handle_entityref
# Post monkeypatching
# <td>S&P500</td>
print(BeautifulSoup(html))
Hopefully someone more versed in bs4 can give you a proper solution, good luck.
Upvotes: 1