jd.
jd.

Reputation: 4783

BeautifulSoup incorrectly checking child membership for NavigableString elements?

I have an HTML page with part of its tree that looks something like this (see the code snippet below containing the html):

                       <body>
                       |    |
                       |    |
     <div id="Kentucky">    <div id="NewOrleans">
             |                      |
             |                      |
          Bourbon                Bourbon

Why is BeautifulSoup indicating that the "left" Bourbon is a child of both "Kentucky" (correct) and "NewOrleans" (incorrect)?

And vice-versa, that the right Bourbon is a child of "Kentucky" (incorrect).

Having different html elements throughount a page, all of them with identical text is not uncommon (e.g. at the header, footer). But now, after I do a find_all() for some text pattern, I cannot trust BeautifulSoup when using header.children or footer.children to correctly identify if the text element is a child of either one.

(It's just as if in a Company, both Engineering and Marketing departments both claimed that a particular Employee belongs to them, just because her first name is "Sarah" - there could be multiple Sarahs in the company - the first_name attribute is just one of many for that object and it shouldn't solely determine the identity.)

Can something like this be avoided, or, what is another approach to find out an element's correct child?

Note that the MRO of the NavigableString class starts with 'str':

<class 'str'>, <class 'bs4.element.PageElement'>, <class 'object'>

which I guess seems to indicate that the cause of the problem is that BeautifulSoup is using string comparisons to determine equality (or identity match) between elements.

Regardless of whether this is indeed the problem, is there an alternative, or a fix/patch?

Thanks!

Code:

import re
from bs4 import BeautifulSoup

TEST_HTML = """<!doctype html>
<head><title>A title</title></head>
<html>
   <body>
      <div id="Kentucky">Bourbon</div>
      <div id="NewOrleans">Bourbon</div>
   </body>
</html>
"""

def test():
    soup = BeautifulSoup(TEST_HTML)

    # search for "Bourbon"
    re_pattern = re.compile('bourbon', re.IGNORECASE)
    text_matches = soup.find_all(text=re_pattern)

    # print verbose debug output...
    for text_match in text_matches:
        print('id: {} - class: {} - text: {} - parent attrs: {}'.\
              format(id(text_match),
                     text_match.__class__.__name__,
                     text_match.string,
                     text_match.parent.attrs))
    # id: 140609176408136 - class: NavigableString - text: Bourbon - parent attrs: {'id': 'Kentucky'}
    # id: 140609176408376 - class: NavigableString - text: Bourbon - parent attrs: {'id': 'NewOrleans'}


    kentucky_match = text_matches[0]
    kentucky_parent = kentucky_match.parent

    new_orleans_match = text_matches[1]
    new_orleans_parent = new_orleans_match.parent

    # confirm -> all ok...
    print(kentucky_parent.attrs)      # {'id': 'Kentucky'}
    print(new_orleans_parent.attrs)   # {'id': 'NewOrleans'}

    # get a list of all the children for both kentucky and new orleans
    # (this tree traversal is all ok)
    ky_children = [child for child in kentucky_parent.children]
    no_children = [child for child in new_orleans_parent.children]

    # confirm -> all ok...
    print([id(child) for child in ky_children])   # [140609176408136]
    print([id(child) for child in no_children])   # [140609176408376]


    # now, here's the problem!!!
    print(kentucky_match in no_children)      # True  -> wrong!!!!!!!
    print(kentucky_match in ky_children)      # True

    print(new_orleans_match in no_children)   # True
    print(new_orleans_match in ky_children)   # True  -> wrong!!!!!!!

Upvotes: 1

Views: 444

Answers (1)

alecxe
alecxe

Reputation: 474031

This is because kentucky_match and new_orleans_match are both instances of NavigableString class, which is a subclass of a regular unicode string.

ky_children and no_children both contain a list of, basically, strings, in your case it is just [u'Bourbon']. And u'Bourbon' in [u'Bourbon'] is always evaluating to True. When in check is performed strings are compared, not NavigableString class instances.

In other words, your in checks are looking for a string in a list of strings.

As a workaround, you can use id() for your in check:

ky_children = [id(child) for child in kentucky_parent.children]
print(id(kentucky_match) in no_children)      # False
print(id(kentucky_match) in ky_children)      # True

Upvotes: 1

Related Questions