Newskooler
Newskooler

Reputation: 4255

Is it possible to use bs4's find_all with a regex in the attribute dictionary key?

I am trying to use bs4 with regex in the value* attribute below:

all_attributes = [
 <s3><Cell value1="384.01"/></s3>,
 <s3><Cell value2="447.82"/></s3>,
 <s3><Cell value3="72.83"/></s3>,
 <s3><Cell value4="325.65"/></s3>,
 <s3><Cell value4="49.34"/></s3>,
 <s3><Cell Textbox4="filler"/></s3>,
 <s3><Cell Textbox4="filler"/></s3>
]

The expression I use is:

[attribute.find_all("Cell", {re.compile(r'^value[0-9]$'): True})
for attribute in all_attributes]

with the hope that I will get the following:

[
 <s3><Cell value1="384.01"/></s3>,
 <s3><Cell value2="447.82"/></s3>,
 <s3><Cell value3="72.83"/></s3>,
 <s3><Cell value4="325.65"/></s3>,
 <s3><Cell value4="49.34"/></s3>
]

however I get an empty list. If I substitute re.compile(r'^value[0-9]$') with value1 or value2 etc. it works as expected, but obviously not delivering what I want.

I guess using re.compile as a dict key does not seem to work, but I don't quite understand why?

Can someone please explain and help find a solution to such a use-case?

Upvotes: 1

Views: 71

Answers (1)

Andrej Kesely
Andrej Kesely

Reputation: 195468

You can use lambda function in .find_all + str.startswith:

from bs4 import BeautifulSoup

html_doc = """
 <s3><Cell value1="384.01"/></s3>,
 <s3><Cell value2="447.82"/></s3>,
 <s3><Cell value3="72.83"/></s3>,
 <s3><Cell value4="325.65"/></s3>,
 <s3><Cell value4="49.34"/></s3>,
 <s3><Cell Textbox4="filler"/></s3>,
 <s3><Cell Textbox4="filler"/></s3>
"""

soup = BeautifulSoup(html_doc, "html.parser")

x = soup.find_all(
    lambda tag: tag.name == "cell"
    and any(a.startswith("value") for a in tag.attrs)
)
print(x)

Prints:

[<cell value1="384.01"></cell>, 
 <cell value2="447.82"></cell>, 
 <cell value3="72.83"></cell>, 
 <cell value4="325.65"></cell>, 
 <cell value4="49.34"></cell>]

Or using regex:

import re

r = re.compile(r"^value\d+")
x = soup.find_all(
    lambda tag: tag.name == "cell" and any(r.search(a) for a in tag.attrs)
)

Upvotes: 1

Related Questions