Miguel Rozsas
Miguel Rozsas

Reputation: 427

BeautifulSoup - How to find a specific class name alone

How to find the li tags with a specific class name but not others? For example:

...
<li> no wanted </li>
<li class="a"> not his one </li>
<li class="a z"> neither this one </li>
<li class="b z"> neither this one </li>
<li class="c z"> neither this one </li>
...
<li class="z"> I WANT THIS ONLY ONE</li>
...

the code:

bs4.find_all ('li', class_='z') returns several entries where there is a "z" and another class name.

How to find the entry with the class name "z", alone ?

Upvotes: 9

Views: 13224

Answers (3)

Raja Muhammad Saad
Raja Muhammad Saad

Reputation: 31

You can simply do:

data = soup.find_all('li',{'class':'z'})
print(data)

If you only want to get text:

for a in data:
   print(a.text)

Upvotes: 1

Keyur Potdar
Keyur Potdar

Reputation: 7238

You can use CSS selectors to match the exact class name.

html = '''<li> no wanted </li>
<li class="a"> not his one </li>
<li class="a z"> neither this one </li>
<li class="b z"> neither this one </li>
<li class="c z"> neither this one </li>
<li class="z"> I WANT THIS ONLY ONE</li>'''

soup = BeautifulSoup(html, 'lxml')

tags = soup.select('li[class="z"]')
print(tags)

The same result can be achieved using lambda:

tags = soup.find_all(lambda tag: tag.name == 'li' and tag.get('class') == ['z'])

Output:

[<li class="z"> I WANT THIS ONLY ONE</li>]

Have a look at Multi-valued attributes. You'll understand why class_='z' matches all the tags that have z in their class name.

HTML 4 defines a few attributes that can have multiple values. HTML 5 removes a couple of them, but defines a few more. The most common multi-valued attribute is class (that is, a tag can have more than one CSS class). Others include rel, rev, accept-charset, headers, and accesskey. Beautiful Soup presents the value(s) of a multi-valued attribute as a list:

css_soup = BeautifulSoup('<p class="body"></p>')
css_soup.p['class']
# ["body"]

css_soup = BeautifulSoup('<p class="body strikeout"></p>')
css_soup.p['class']
# ["body", "strikeout"]

Upvotes: 12

birdspider
birdspider

Reputation: 3074

Possibly with a filter function as in the doc

def is_only_z(css_class):
    return css_class is not None and css_class == 'z'

bs4.find_all('li',class_=is_only_z)

Upvotes: 0

Related Questions