info
info

Reputation: 43

BeautifulSoup search attributes-value

I'm trying to search in HTML documents for specific attribute values. e.g.

<html> 
  <h2 itemprop="prio1">  TEXT PRIO 1 </h2>
  <span id="prio2"> TEXT PRIO 2 </span>
</html>

I want to find all items with atrributes values beginning with "prio"

I know that I can do something like:

soup.find_all(itemprop=re.compile('prio.*')) )

Or

soup.find_all(id=re.compile('prio.*')) )

But what I am looking for is something like:

soup.find_all(*=re.compile('prio.*')) )

Upvotes: 4

Views: 1064

Answers (2)

Padraic Cunningham
Padraic Cunningham

Reputation: 180441

First off your regex is wrong, if you wanted to only find strings starting with prio you would prefix with ^, as it is your regex would match prio anywhere in the string, if you were going to search each attribute you should just use str.startswith:

h = """<html>
  <h2 itemprop="prio1">  TEXT PRIO 1 </h2>
  <span id="prio2"> TEXT PRIO 2 </span>
</html>"""

soup = BeautifulSoup(h, "lxml")


tags = soup.find_all(lambda t: any(a.startswith("prio") for a in t.attrs.values()))

If you just want to check for certain attributes:

tags = soup.find_all(lambda t: t.get("id","").startswith("prio") or t.get("itemprop","").startswith("prio"))

But if you wanted a more efficient solution you might want to look at lxml which allows you to use wildcards:

from lxml import html

xml = html.fromstring(h)

tags = xml.xpath("//*[starts-with(@*,'prio')]")
print(tags)

Or just id an itemprop:

tags = xml.xpath("//*[starts-with(@id,'prio') or starts-with(@itemprop, 'prio')]")

Upvotes: 1

Remi Guan
Remi Guan

Reputation: 22282

I don't know if this is the best way, but this works:

>>> soup.find_all(lambda element: any(re.search('prio.*', attr) for attr in element.attrs.values()))
[<h2 itemprop="prio1">  TEXT PRIO 1 </h2>, <span id="prio2"> TEXT PRIO 2 </span>]

In this case, you can access the element use lambda in lambda element:. And we search for 'prio.*' use re.search in the element.attrs.values() list.

Then, we use any() on the result to see if there's an element which has an attribute and it's value starts with 'prio'.


You can also use str.startswith here instead of RegEx since you're just trying to check that attributes-value starts with 'prio' or not, like below:

soup.find_all(lambda element: any(attr.startswith('prio') for attr in element.attrs.values())))

Upvotes: 0

Related Questions