Acuity
Acuity

Reputation: 47

How do I have nested find_all statements in BeautifulSoup (Python)?

I started off by pulling the page with Selenium and I believe I passed the part of the page I needed to BeautifulSoup correctly using this code:

soup = BeautifulSoup(driver.find_element("xpath", '//*[@id="version_table"]/tbody').get_attribute('outerHTML'))

Now I can parse using BeautifulSoup

query = soup.find_all("tr", class_=lambda x: x != "hidden*")
print (query)

My problem is that I need to dig deeper than just this one query. For example, I would like to nest this one inside of the first (so the first needs to be true, and then this one):

query2 = soup.find_all("tr", id = "version_new_*")
print (query2)

Logically speaking, this is what I'm trying to do (but I get SyntaxError: invalid syntax):

query = soup.find_all(("tr", class_=lambda x: x != "hidden*") and ("tr", id = "version_new_*"))
print (query)

How do I accomplish this?

Upvotes: 1

Views: 190

Answers (3)

HedgeHog
HedgeHog

Reputation: 25048

As mentioned without any example it is hard to help or give a precise answer - However you could use a css selector:

soup.select('tr[id^="version_new_"]:not(.hidden)')

Example

from bs4 import BeautifulSoup

html = '''
<tr id="version_new_1" class="hidden"></tr>
<tr id="version_new_2"></tr>
<tr id="version_new_3" class="hidden"></tr>
<tr id="version_new_4"></tr>
'''

soup = BeautifulSoup(html)

soup.select('tr[id^="version_new_"]:not(.hidden)')

Output

Will be a ResultSet you could iterate to scrape what you need.

[<tr id="version_new_2"></tr>, <tr id="version_new_4"></tr>]

Upvotes: 2

NFeruch - FreePalestine
NFeruch - FreePalestine

Reputation: 1145

You can use a lambda function (along with regex) for every element to do some advanced conditioning

import re

query = soup.find_all(
    lambda tag: 
        tag.name == 'tr' and
        'id' in tag.attrs and re.search('^version_new_*', str(tag.attrs['id'])) and
        'class' in tag.attrs and not re.search('^hidden*', str(tag.attrs['class']))
)
print(list(query))

For every element in the html, we are checking...

  1. If the tag is a table row (tr)
  2. If the tag has an id and if that id matches the pattern
  3. If the tag has a class and if that class matches the pattern

Upvotes: 1

OysterShucker
OysterShucker

Reputation: 5531

Regarding: query = soup.find_all(...) and print (query)

find_all is going to return an iterable type. Iterable types can be iterated.

for query in soup.find_all(...): 
    print(query)

Upvotes: 1

Related Questions