GhostKU
GhostKU

Reputation: 2108

How to find tag by text with regex?

I need to get HTML tag by part of its text. I found some solutions but it doesn't work well for me.

from bs4 import BeautifulSoup
import re
soup = BeautifulSoup("""
<table>
    <tbody>
        <tr>
            <td style="width: 100px; height: 20px">
                <div style="font-size: 8.7pt">
                    Арт.: 
                    <span id="ContentPlaceHolder1_ContentPlaceHolder1_DataList2_Label12_0"> 1185A</span>
                    </div>
                <div style="font-size: 12pt; font-weight: bold;">
                    <span id="ContentPlaceHolder1_ContentPlaceHolder1_DataList2_LoginView3_0_Label12_0">I_CAN_GET_THIS other text</span>
                    I CAN NOT GET THIS?.
                </div>
            </td>
        </tr>
    </tbody>
</table>
""", 'lxml')
print(soup.find('span', text=re.compile('I_CAN_GET_THIS')))
print(soup.find('div', text=re.compile('I_CAN_NOT_GET_THIS')))

>>> <span id="ContentPlaceHolder1_ContentPlaceHolder1_DataList2_LoginView3_0_Label12_0">I_CAN_GET_THIS other text</span>
>>> None

So I can;t understand why it doesn't work in the second case and what should I do to make it works? Thanks

Upvotes: 2

Views: 691

Answers (1)

alecxe
alecxe

Reputation: 474191

The text argument (which is now renamed to string but is still supported) would use the .string attribute of an element which would become None if there is more than one child:

If a tag contains more than one thing, then it’s not clear what .string should refer to, so .string is defined to be None

This is exactly the case with your target div element - it has a span child and a text node.

Instead, you can locate the text node and then get it's parent:

soup.find(text=re.compile('I CAN NOT GET THIS')).parent

Or, use a searching function where you would use .get_text() which combines children texts:

soup.find(lambda tag: tag.name == 'div' and 'I CAN NOT GET THIS' in tag.get_text())

Upvotes: 3

Related Questions