Reputation: 467
I want to use Beautiful Soup to pull out anything with the following format:
div class="dog-a b-cat"
I can get a particular instance if I know what "a" and "b" are by doing the following (suppose a=aardvark
and b=boy
):
soup.find_all("div",class_="dog-aardvark boy-cat")
Is there any way I can pull out all instances (regardless of the two words between the dashes) with dog and cat and two dashes in between?
Upvotes: 1
Views: 139
Reputation: 473873
@bourbaki4481472 is on the right track in general but the proposed solution would not work because of multiple reasons, starting with that the specified regular expression would be matched against a single class at a time, since class
is a special multi-valued attribute and ending with it's simply syntactically incorrect.
I suggest you make a filtering function that would check that the first class value starts-with dog-
and the second one ends with -cat
. You may improve it by additionally checking the tag name or how much class values are present if needed:
def class_filter(elm):
try:
classes = elm["class"]
return classes[0].startswith("dog-") and classes[1].endswith("-cat")
except (KeyError, IndexError, TypeError):
return False
Complete example:
from bs4 import BeautifulSoup
data = """
<div class="dog-test test-cat">test1</div>
<div class="dog-test">test2</div>
<div class="test-cat">test3</div>
<div class="dog">test4</div>
<div class="cat">test5</div>
<div class="irrelevant">test6</div>
"""
soup = BeautifulSoup(data)
def class_filter(elm):
try:
classes = elm["class"]
return classes[0].startswith("dog-") and classes[1].endswith("-cat")
except (KeyError, IndexError, TypeError):
return False
for elm in soup.find_all(class_filter):
print(elm.text)
Prints test1
only.
Upvotes: 2
Reputation: 1824
Try using regular expressions to generalize your parameters.
import re
soup.find_all("div", class= re.compile(r"dog-.+ boy-.+")
The above would look for strings dog-
followed by one or more characters, followed by [space], and followed by boy-
followed by one or more characters.
Upvotes: 0