Reputation: 4008
I have the following structure:
<div class="alpha">
<div class="alpha">
<div class="alpha">
<div class="alpha betha">
<div class="alpha gama">
<div class="alpha">
I need to
I know I can get all elements that have a class
container.findAll('div', {'class': 'alpha'})
But how to separate/ignore for 2 and 3.
Upvotes: 1
Views: 1933
Reputation: 8215
Why not create a helper function ?
bs4
allows you to specify a function as filter
while searching the tree with find_all().
From the docs:
If none of the other matches work for you, define a function that takes an element as its only argument. The function should return
True
if the argument matches, andFalse
otherwise.
The issue is that we can't pass any other arguments (A list of valid classes in this case). We can overcome this by using a wrapper function to dynamically create the filters.
def create_filter(tag_name, class_list):
def class_filter(tag):
return (
tag.name == tag_name and
set(tag.get('class', [])) == set(class_list)
)
return class_filter
Let's see how this works on @AndrejKesely 's sample html.
Only alpha
print(soup.find_all(create_filter('div', ['alpha'])))
Output
[<div class="alpha">1</div>, <div class="alpha">2</div>, <div class="alpha">3</div>, <div class="alpha">8</div>]
Only alpha
and betha
print(soup.find_all(create_filter('div', ['alpha', 'betha'])))
Output
[<div class="alpha betha">4</div>, <div class="betha alpha">5</div>]
Upvotes: 2
Reputation: 195438
You can use CSS selectors with the .select()
method:
txt = '''<div class="alpha">1</div>
<div class="alpha">2</div>
<div class="alpha">3</div>
<div class="alpha betha">4</div>
<div class="betha alpha">5</div>
<div class="alpha betha gama">6</div>
<div class="alpha gama">7</div>
<div class="alpha">8</div>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(txt, 'html.parser')
only_alpha = soup.select('[class="alpha"]')
only_alpha_betha = soup.select('.alpha.betha:not(.gama)')
print('Only alpha:', only_alpha)
print('Only alpha and betha:', only_alpha_betha)
Prints:
Only alpha: [<div class="alpha">1</div>, <div class="alpha">2</div>, <div class="alpha">3</div>, <div class="alpha">8</div>]
Only alpha and betha: [<div class="alpha betha">4</div>, <div class="betha alpha">5</div>]
Upvotes: 4