user3541631
user3541631

Reputation: 4008

Beautiful soup - select all elements that have a class, but also separate them or ignore elements if they have also other clases

I have the following structure:

<div class="alpha">
<div class="alpha">
<div class="alpha">
<div class="alpha betha">
<div class="alpha gama">
<div class="alpha">

I need to

  1. get all the elements that have only 'alpha' as a class in a list.
  2. get all elements that have 'alpha' and 'betha'in another list.
  3. Ignore other combinations like "alpha gama".

I know I can get all elements that have a class

container.findAll('div', {'class': 'alpha'})

But how to separate/ignore for 2 and 3.

Upvotes: 1

Views: 1933

Answers (2)

Bitto
Bitto

Reputation: 8215

Why not create a helper function ?

bs4 allows you to specify a function as filter while searching the tree with find_all().

From the docs:

If none of the other matches work for you, define a function that takes an element as its only argument. The function should return True if the argument matches, and False otherwise.

The issue is that we can't pass any other arguments (A list of valid classes in this case). We can overcome this by using a wrapper function to dynamically create the filters.

def create_filter(tag_name, class_list):
    def class_filter(tag):
        return (
            tag.name == tag_name and
            set(tag.get('class', [])) == set(class_list)
        )
    return class_filter

Let's see how this works on @AndrejKesely 's sample html.

Only alpha

print(soup.find_all(create_filter('div', ['alpha'])))

Output

[<div class="alpha">1</div>, <div class="alpha">2</div>, <div class="alpha">3</div>, <div class="alpha">8</div>]

Only alpha and betha

print(soup.find_all(create_filter('div', ['alpha', 'betha'])))

Output

[<div class="alpha betha">4</div>, <div class="betha alpha">5</div>]

Upvotes: 2

Andrej Kesely
Andrej Kesely

Reputation: 195438

You can use CSS selectors with the .select() method:

txt = '''<div class="alpha">1</div>
<div class="alpha">2</div>
<div class="alpha">3</div>
<div class="alpha betha">4</div>
<div class="betha alpha">5</div>
<div class="alpha betha gama">6</div>
<div class="alpha gama">7</div>
<div class="alpha">8</div>'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(txt, 'html.parser')

only_alpha = soup.select('[class="alpha"]')
only_alpha_betha = soup.select('.alpha.betha:not(.gama)')

print('Only alpha:', only_alpha)
print('Only alpha and betha:', only_alpha_betha)

Prints:

Only alpha: [<div class="alpha">1</div>, <div class="alpha">2</div>, <div class="alpha">3</div>, <div class="alpha">8</div>]
Only alpha and betha: [<div class="alpha betha">4</div>, <div class="betha alpha">5</div>]

Upvotes: 4

Related Questions