YJZ
YJZ

Reputation: 4204

beautifulsoup: find_all on bs4.element.ResultSet object or list?

I apply find_all on a beautifulsoup object, and find something, which is an bs4.element.ResultSet object or a list.

I want to further do find_all in there, but it's not allowed on a bs4.element.ResultSet object. I can loop through each element of the bs4.element.ResultSet object to do find_all. But can I avoid looping and just convert it back to a beautifulsoup object?

Here is my code:

html_1 = """
<table>
    <thead>
        <tr class="myClass">
            <th>A</th>
            <th>B</th>
            <th>C</th>
            <th>D</th>
        </tr>
    </thead>
</table>
"""
soup = BeautifulSoup(html_1, 'html.parser')

type(soup) #bs4.BeautifulSoup

# do find_all on beautifulsoup object
th_all = soup.find_all('th')

# the result is of type bs4.element.ResultSet or similarly list
type(th_all) #bs4.element.ResultSet
type(th_all[0:1]) #list

# now I want to further do find_all
th_all.find_all(text='A') #not work

# can I avoid this need of loop?
for th in th_all:
    th.find_all(text='A') #works

Upvotes: 29

Views: 83589

Answers (2)

cottontail
cottontail

Reputation: 23171

I know this is many years too late but I went down the rabbit hole the other day and found that ResultSet is subclassed from a Python list (source code); it's really just a list that additionally has .source attribute which is often a null object.

Now, back to OP's main question, a filtering can be done during the find_all() call by passing the tag type and the string it should match. This returns a ResultSet of tags (denoted th_all below). To extract the actual texts inside these tags, we must loop again to through it.

html_1 = """
<table>
    <thead>
        <tr class="myClass">
            <th>A</th>
            <th>B</th>
            <th>C</th>
            <th>D</th>
        </tr>
    </thead>
</table>
"""
soup = BeautifulSoup(html_1, 'html.parser')

th_all = soup.find_all('th', string='A')  # [<th>A</th>]

texts = [th.string for th in th_all]      # ['A']

To answer the second part of the question:

How do we convert a ResultSet into a BeautifulSoup object?

We can explicitly cast it as one. Then we can call find_all() on it.

th_all = soup.find_all('th')
soup2 = BeautifulSoup('\n'.join(map(str, th_all)))
soup2.find_all(string='A')   # ['A']

However, since we can already do a search on the ResultSet, it's probably not desirable in this context.

Upvotes: 1

alecxe
alecxe

Reputation: 473873

ResultSet class is a subclass of a list and not a Tag class which has the find* methods defined. Looping through the results of find_all() is the most common approach:

th_all = soup.find_all('th')
result = []
for th in th_all:
    result.extend(th.find_all(text='A'))

Usually, CSS selectors may help you solve it in one go except that not everything you can do with find_all() is possible with the select() method. For instance, there is no "text" search available in bs4 CSS selectors. But, if, for example, you had to find all, say, b elements inside th elements, you could do:

soup.select("th td")

Upvotes: 28

Related Questions