theelmix
theelmix

Reputation: 61

how can i get tag element by text content in beautifulsoup4

I have to scrap data from a thousand sites, local HTML files, the complication is that these sites are like 90's structure, almost same nested tables structure, no id's no CSS classes only nested tables, how can I select a specific table base in the text in one tr tag.

XPath is not a solution because the sites are mainly the same structure, but not always have the same table order, so I'm looking for a way to extract those table data from all of them, selecting or searching certain table b some text in it and by that obtain the parent tag.

Any idea?

The code on every page is huge, here is an examaple of the structure, the data is not always on the same table position.

Update: thanks to alecxe i made this code

# coding: utf-8
from bs4 import BeautifulSoup
import sys
reload(sys)
sys.setdefaultencoding('utf-8')

html_content = """
<body>
 <table id="gotthistable">
     <tr>
         <table id="needthistable">
             <tr>
                 <td>text i'm searching</td>
             </tr>
             <tr>
                 <td>Some other text</td>
             </tr>
         </table>
     </tr>
     <tr>
         <td>
             <table>
                 <tr>
                     <td>Other text</td>
                 </tr>
                 <tr>
                     <td>Some other text</td>
                 </tr>
             </table>
         </td>
     </tr>
 </table>

 <table>
     <tr>
         <td>Different table</td>
 </tr>
 </table>
</body>
 """
soup = BeautifulSoup(html_content, "lxml")
table = soup.find(lambda tag: tag.name == "table" and "searching" in tag.text)
print table

the output of print table or soup var is the same:

<table>
    <tr>
        <table id="needthistable">
            <tr>
                <td>text i'm searching</td>
            </tr>
    </tr>
.
.
.

but with this code:

soup = BeautifulSoup(html_content, "lxml")
table = soup.find(lambda tag: tag.name == "td" and "searching" in tag.text).parent.parent
print table

i got the output that i want:

<table id="needthistable">
    <tr>
        <td>text im searching</td>
    </tr>
    <tr>
        <td>Some other text</td>
    </tr>
</table>

but what if is not always on the same two parent elements? i mean if a got this td tag how can i get the table where it belong.

Upvotes: 0

Views: 642

Answers (2)

alecxe
alecxe

Reputation: 473853

You should use find() with a searching function and check the .text of a table to contain the desired text:

soup.find(lambda tag: tag.name == "table" and "part of text" in tag.text)

Demo:

>>> from bs4 import BeautifulSoup
>>> 
>>> data = """
... <body>
...     <table>
...         <tr>
...             <td>This text has a part of text</td>
...         </tr>
...         <tr>
...             <td>Some other text</td>
...         </tr>
...     </table>
... 
...     <table>
...         <tr>
...             <td>Different table</td>
...         </tr>
...     </table>
... </body>
... 
... """
>>> 
>>> soup = BeautifulSoup(data, 'lxml')
>>> 
>>> table = soup.find(lambda tag: tag.name == "table" and "part of text" in tag.text)
>>> print(table)
<table>
    <tr>
        <td>This text has a part of text</td>
    </tr>
    <tr>
        <td>Some other text</td>
    </tr>
</table>

Upvotes: 0

宏杰李
宏杰李

Reputation: 12158

use BeautifulSoup regex filter:

If you pass in a regular expression object, Beautiful Soup will filter against that regular expression using its search() method.

Example:

soup.find_all(name='tr', text=re.compile('this is part or full text of tr'))

Upvotes: 1

Related Questions