BeautifulSoup incorrectly checking child membership for NavigableString elements?

Question

I have an HTML page with part of its tree that looks something like this (see the code snippet below containing the html):

                       
                       |    |
                       |    |
         
             |                      |
             |                      |
          Bourbon                Bourbon

Why is BeautifulSoup indicating that the "left" Bourbon is a child of both "Kentucky" (correct) and "NewOrleans" (incorrect)?

And vice-versa, that the right Bourbon is a child of "Kentucky" (incorrect).

Having different html elements throughount a page, all of them with identical text is not uncommon (e.g. at the header, footer). But now, after I do a find_all() for some text pattern, I cannot trust BeautifulSoup when using header.children or footer.children to correctly identify if the text element is a child of either one.

(It's just as if in a Company, both Engineering and Marketing departments both claimed that a particular Employee belongs to them, just because her first name is "Sarah" - there could be multiple Sarahs in the company - the first_name attribute is just one of many for that object and it shouldn't solely determine the identity.)

Can something like this be avoided, or, what is another approach to find out an element's correct child?

Note that the MRO of the NavigableString class starts with 'str':

, ,

which I guess seems to indicate that the cause of the problem is that BeautifulSoup is using string comparisons to determine equality (or identity match) between elements.

Regardless of whether this is indeed the problem, is there an alternative, or a fix/patch?

Thanks!

Code:

import re
from bs4 import BeautifulSoup

TEST_HTML = """
A title

   
      Bourbon
      Bourbon
   

"""

def test():
    soup = BeautifulSoup(TEST_HTML)

    # search for "Bourbon"
    re_pattern = re.compile('bourbon', re.IGNORECASE)
    text_matches = soup.find_all(text=re_pattern)

    # print verbose debug output...
    for text_match in text_matches:
        print('id: {} - class: {} - text: {} - parent attrs: {}'.\
              format(id(text_match),
                     text_match.__class__.__name__,
                     text_match.string,
                     text_match.parent.attrs))
    # id: 140609176408136 - class: NavigableString - text: Bourbon - parent attrs: {'id': 'Kentucky'}
    # id: 140609176408376 - class: NavigableString - text: Bourbon - parent attrs: {'id': 'NewOrleans'}


    kentucky_match = text_matches[0]
    kentucky_parent = kentucky_match.parent

    new_orleans_match = text_matches[1]
    new_orleans_parent = new_orleans_match.parent

    # confirm -> all ok...
    print(kentucky_parent.attrs)      # {'id': 'Kentucky'}
    print(new_orleans_parent.attrs)   # {'id': 'NewOrleans'}

    # get a list of all the children for both kentucky and new orleans
    # (this tree traversal is all ok)
    ky_children = [child for child in kentucky_parent.children]
    no_children = [child for child in new_orleans_parent.children]

    # confirm -> all ok...
    print([id(child) for child in ky_children])   # [140609176408136]
    print([id(child) for child in no_children])   # [140609176408376]


    # now, here's the problem!!!
    print(kentucky_match in no_children)      # True  -> wrong!!!!!!!
    print(kentucky_match in ky_children)      # True

    print(new_orleans_match in no_children)   # True
    print(new_orleans_match in ky_children)   # True  -> wrong!!!!!!!

BeautifulSoup incorrectly checking child membership for NavigableString elements?

Answers (1)

Related Questions