Reputation: 4255
I am trying to use bs4 with regex in the value*
attribute below:
all_attributes = [
<s3><Cell value1="384.01"/></s3>,
<s3><Cell value2="447.82"/></s3>,
<s3><Cell value3="72.83"/></s3>,
<s3><Cell value4="325.65"/></s3>,
<s3><Cell value4="49.34"/></s3>,
<s3><Cell Textbox4="filler"/></s3>,
<s3><Cell Textbox4="filler"/></s3>
]
The expression I use is:
[attribute.find_all("Cell", {re.compile(r'^value[0-9]$'): True})
for attribute in all_attributes]
with the hope that I will get the following:
[
<s3><Cell value1="384.01"/></s3>,
<s3><Cell value2="447.82"/></s3>,
<s3><Cell value3="72.83"/></s3>,
<s3><Cell value4="325.65"/></s3>,
<s3><Cell value4="49.34"/></s3>
]
however I get an empty list.
If I substitute re.compile(r'^value[0-9]$')
with value1
or value2
etc. it works as expected, but obviously not delivering what I want.
I guess using re.compile
as a dict key does not seem to work, but I don't quite understand why?
Can someone please explain and help find a solution to such a use-case?
Upvotes: 1
Views: 71
Reputation: 195468
You can use lambda function in .find_all
+ str.startswith
:
from bs4 import BeautifulSoup
html_doc = """
<s3><Cell value1="384.01"/></s3>,
<s3><Cell value2="447.82"/></s3>,
<s3><Cell value3="72.83"/></s3>,
<s3><Cell value4="325.65"/></s3>,
<s3><Cell value4="49.34"/></s3>,
<s3><Cell Textbox4="filler"/></s3>,
<s3><Cell Textbox4="filler"/></s3>
"""
soup = BeautifulSoup(html_doc, "html.parser")
x = soup.find_all(
lambda tag: tag.name == "cell"
and any(a.startswith("value") for a in tag.attrs)
)
print(x)
Prints:
[<cell value1="384.01"></cell>,
<cell value2="447.82"></cell>,
<cell value3="72.83"></cell>,
<cell value4="325.65"></cell>,
<cell value4="49.34"></cell>]
Or using regex:
import re
r = re.compile(r"^value\d+")
x = soup.find_all(
lambda tag: tag.name == "cell" and any(r.search(a) for a in tag.attrs)
)
Upvotes: 1