how can i get tag element by text content in beautifulsoup4

Question

I have to scrap data from a thousand sites, local HTML files, the complication is that these sites are like 90's structure, almost same nested tables structure, no id's no CSS classes only nested tables, how can I select a specific table base in the text in one tr tag.

XPath is not a solution because the sites are mainly the same structure, but not always have the same table order, so I'm looking for a way to extract those table data from all of them, selecting or searching certain table b some text in it and by that obtain the parent tag.

Any idea?

The code on every page is huge, here is an examaple of the structure, the data is not always on the same table position.

Update: thanks to alecxe i made this code

# coding: utf-8
from bs4 import BeautifulSoup
import sys
reload(sys)
sys.setdefaultencoding('utf-8')

html_content = """

 
     
         

             
                 text i'm searching
             
             
                 Some other text
             
         
     
     
         
             
                 
                     Other text
                 
                 
                     Some other text
                 
             
         
     
 

 
     
         Different table
 
 

 """
soup = BeautifulSoup(html_content, "lxml")
table = soup.find(lambda tag: tag.name == "table" and "searching" in tag.text)
print table

the output of print table or soup var is the same:


    
        
.
.
.
but with this code:
soup = BeautifulSoup(html_content, "lxml")
table = soup.find(lambda tag: tag.name == "td" and "searching" in tag.text).parent.parent
print table
i got the output that i want:

            
                text i'm searching
            
    








    
        text im searching
    
    
        Some other text



but what if is not always on the same two parent elements? i mean if a got this td tag how can i get the table where it belong.

alecxe · Accepted Answer

You should use find() with a searching function and check the .text of a table to contain the desired text:

soup.find(lambda tag: tag.name == "table" and "part of text" in tag.text)

Demo:

>>> from bs4 import BeautifulSoup
>>> 
>>> data = """
... 
...     
...         
...             
...         
...         
...             
...         
...     This text has a part of text
Some other text
... 
...     
...         
...             
...         
...     Different table
... 
... 
... """
>>> 
>>> soup = BeautifulSoup(data, 'lxml')
>>> 
>>> table = soup.find(lambda tag: tag.name == "table" and "part of text" in tag.text)
>>> print(table)

    
        This text has a part of text
    
    
        Some other text

how can i get tag element by text content in beautifulsoup4

Answers (2)

Related Questions