How to find tag by text with regex?

Question

I need to get HTML tag by part of its text. I found some solutions but it doesn't work well for me.

from bs4 import BeautifulSoup
import re
soup = BeautifulSoup("""

    
        
            
                
                    Арт.: 
                     1185A
                    
                
                    I_CAN_GET_THIS other text
                    I CAN NOT GET THIS?.
                
            
        
    

""", 'lxml')
print(soup.find('span', text=re.compile('I_CAN_GET_THIS')))
print(soup.find('div', text=re.compile('I_CAN_NOT_GET_THIS')))

>>> I_CAN_GET_THIS other text
>>> None

So I can;t understand why it doesn't work in the second case and what should I do to make it works? Thanks

alecxe · Accepted Answer

The text argument (which is now renamed to string but is still supported) would use the .string attribute of an element which would become None if there is more than one child:

If a tag contains more than one thing, then it’s not clear what .string should refer to, so .string is defined to be None

This is exactly the case with your target div element - it has a span child and a text node.

Instead, you can locate the text node and then get it's parent:

soup.find(text=re.compile('I CAN NOT GET THIS')).parent

Or, use a searching function where you would use .get_text() which combines children texts:

soup.find(lambda tag: tag.name == 'div' and 'I CAN NOT GET THIS' in tag.get_text())

How to find tag by text with regex?

Answers (1)

Related Questions