Reputation: 43
I'm trying to search in HTML documents for specific attribute values. e.g.
<html>
<h2 itemprop="prio1"> TEXT PRIO 1 </h2>
<span id="prio2"> TEXT PRIO 2 </span>
</html>
I want to find all items with atrributes values beginning with "prio"
I know that I can do something like:
soup.find_all(itemprop=re.compile('prio.*')) )
Or
soup.find_all(id=re.compile('prio.*')) )
But what I am looking for is something like:
soup.find_all(*=re.compile('prio.*')) )
Upvotes: 4
Views: 1064
Reputation: 180441
First off your regex is wrong, if you wanted to only find strings starting with prio you would prefix with ^
, as it is your regex would match prio anywhere in the string, if you were going to search each attribute you should just use str.startswith:
h = """<html>
<h2 itemprop="prio1"> TEXT PRIO 1 </h2>
<span id="prio2"> TEXT PRIO 2 </span>
</html>"""
soup = BeautifulSoup(h, "lxml")
tags = soup.find_all(lambda t: any(a.startswith("prio") for a in t.attrs.values()))
If you just want to check for certain attributes:
tags = soup.find_all(lambda t: t.get("id","").startswith("prio") or t.get("itemprop","").startswith("prio"))
But if you wanted a more efficient solution you might want to look at lxml which allows you to use wildcards:
from lxml import html
xml = html.fromstring(h)
tags = xml.xpath("//*[starts-with(@*,'prio')]")
print(tags)
Or just id an itemprop:
tags = xml.xpath("//*[starts-with(@id,'prio') or starts-with(@itemprop, 'prio')]")
Upvotes: 1
Reputation: 22282
I don't know if this is the best way, but this works:
>>> soup.find_all(lambda element: any(re.search('prio.*', attr) for attr in element.attrs.values()))
[<h2 itemprop="prio1"> TEXT PRIO 1 </h2>, <span id="prio2"> TEXT PRIO 2 </span>]
In this case, you can access the element use lambda
in lambda element:
. And we search for 'prio.*'
use re.search
in the element.attrs.values()
list.
Then, we use any()
on the result to see if there's an element which has an attribute and it's value starts with 'prio'
.
You can also use str.startswith
here instead of RegEx since you're just trying to check that attributes-value starts with 'prio'
or not, like below:
soup.find_all(lambda element: any(attr.startswith('prio') for attr in element.attrs.values())))
Upvotes: 0