Shane
Shane

Reputation: 467

Searching for pieces of an attribute with Beautiful Soup

I want to use Beautiful Soup to pull out anything with the following format:

div class="dog-a b-cat"

I can get a particular instance if I know what "a" and "b" are by doing the following (suppose a=aardvark and b=boy):

soup.find_all("div",class_="dog-aardvark boy-cat")

Is there any way I can pull out all instances (regardless of the two words between the dashes) with dog and cat and two dashes in between?

Upvotes: 1

Views: 139

Answers (2)

alecxe
alecxe

Reputation: 473873

@bourbaki4481472 is on the right track in general but the proposed solution would not work because of multiple reasons, starting with that the specified regular expression would be matched against a single class at a time, since class is a special multi-valued attribute and ending with it's simply syntactically incorrect.

I suggest you make a filtering function that would check that the first class value starts-with dog- and the second one ends with -cat. You may improve it by additionally checking the tag name or how much class values are present if needed:

def class_filter(elm):
    try:
        classes = elm["class"]
        return classes[0].startswith("dog-") and classes[1].endswith("-cat")
    except (KeyError, IndexError, TypeError):
        return False

Complete example:

from bs4 import BeautifulSoup

data = """
<div class="dog-test test-cat">test1</div>
<div class="dog-test">test2</div>
<div class="test-cat">test3</div>
<div class="dog">test4</div>
<div class="cat">test5</div>
<div class="irrelevant">test6</div>
"""

soup = BeautifulSoup(data)

def class_filter(elm):
    try:
        classes = elm["class"]
        return classes[0].startswith("dog-") and classes[1].endswith("-cat")
    except (KeyError, IndexError, TypeError):
        return False

for elm in soup.find_all(class_filter):
    print(elm.text)

Prints test1 only.

Upvotes: 2

under_the_sea_salad
under_the_sea_salad

Reputation: 1824

Try using regular expressions to generalize your parameters.

import re
soup.find_all("div", class= re.compile(r"dog-.+ boy-.+")

The above would look for strings dog- followed by one or more characters, followed by [space], and followed by boy- followed by one or more characters.

Upvotes: 0

Related Questions