Reputation: 4783
I have an HTML page with part of its tree that looks something like this (see the code snippet below containing the html):
<body>
| |
| |
<div id="Kentucky"> <div id="NewOrleans">
| |
| |
Bourbon Bourbon
Why is BeautifulSoup indicating that the "left" Bourbon is a child of both "Kentucky" (correct) and "NewOrleans" (incorrect)?
And vice-versa, that the right Bourbon is a child of "Kentucky" (incorrect).
Having different html elements throughount a page, all of them with identical text is not uncommon (e.g. at the header, footer). But now, after I do a find_all() for some text pattern, I cannot trust BeautifulSoup when using header.children or footer.children to correctly identify if the text element is a child of either one.
(It's just as if in a Company, both Engineering and Marketing departments both claimed that a particular Employee belongs to them, just because her first name is "Sarah" - there could be multiple Sarahs in the company - the first_name attribute is just one of many for that object and it shouldn't solely determine the identity.)
Can something like this be avoided, or, what is another approach to find out an element's correct child?
Note that the MRO of the NavigableString class starts with 'str':
<class 'str'>, <class 'bs4.element.PageElement'>, <class 'object'>
which I guess seems to indicate that the cause of the problem is that BeautifulSoup is using string comparisons to determine equality (or identity match) between elements.
Regardless of whether this is indeed the problem, is there an alternative, or a fix/patch?
Thanks!
Code:
import re
from bs4 import BeautifulSoup
TEST_HTML = """<!doctype html>
<head><title>A title</title></head>
<html>
<body>
<div id="Kentucky">Bourbon</div>
<div id="NewOrleans">Bourbon</div>
</body>
</html>
"""
def test():
soup = BeautifulSoup(TEST_HTML)
# search for "Bourbon"
re_pattern = re.compile('bourbon', re.IGNORECASE)
text_matches = soup.find_all(text=re_pattern)
# print verbose debug output...
for text_match in text_matches:
print('id: {} - class: {} - text: {} - parent attrs: {}'.\
format(id(text_match),
text_match.__class__.__name__,
text_match.string,
text_match.parent.attrs))
# id: 140609176408136 - class: NavigableString - text: Bourbon - parent attrs: {'id': 'Kentucky'}
# id: 140609176408376 - class: NavigableString - text: Bourbon - parent attrs: {'id': 'NewOrleans'}
kentucky_match = text_matches[0]
kentucky_parent = kentucky_match.parent
new_orleans_match = text_matches[1]
new_orleans_parent = new_orleans_match.parent
# confirm -> all ok...
print(kentucky_parent.attrs) # {'id': 'Kentucky'}
print(new_orleans_parent.attrs) # {'id': 'NewOrleans'}
# get a list of all the children for both kentucky and new orleans
# (this tree traversal is all ok)
ky_children = [child for child in kentucky_parent.children]
no_children = [child for child in new_orleans_parent.children]
# confirm -> all ok...
print([id(child) for child in ky_children]) # [140609176408136]
print([id(child) for child in no_children]) # [140609176408376]
# now, here's the problem!!!
print(kentucky_match in no_children) # True -> wrong!!!!!!!
print(kentucky_match in ky_children) # True
print(new_orleans_match in no_children) # True
print(new_orleans_match in ky_children) # True -> wrong!!!!!!!
Upvotes: 1
Views: 444
Reputation: 474031
This is because kentucky_match
and new_orleans_match
are both instances of NavigableString
class, which is a subclass of a regular unicode
string.
ky_children
and no_children
both contain a list of, basically, strings, in your case it is just [u'Bourbon']
. And u'Bourbon' in [u'Bourbon']
is always evaluating to True
. When in
check is performed strings are compared, not NavigableString
class instances.
In other words, your in
checks are looking for a string in a list of strings.
As a workaround, you can use id()
for your in
check:
ky_children = [id(child) for child in kentucky_parent.children]
print(id(kentucky_match) in no_children) # False
print(id(kentucky_match) in ky_children) # True
Upvotes: 1