nutship
nutship

Reputation: 4924

BeautifulSoup, simple regex issue

I just hit a snag with regex and have no idea why this's not working.

Here is what BeautifulSoup doc says:

soup.find_all(class_=re.compile("itl"))
# [<p class="title"><b>The Dormouse's story</b></p>]

Here is my html:

<a href="exam.com" title="Keeper: Jay" class="pos_text">Aouate</a></span><span class="pos_text pos3_l_4">

and I'm trying to match the span tag (last position).

>>> if soup.find(class_=re.compile("pos_text pos3_l_\d{1}")):
        print "Yes"

# prints nothing - indicating there is no such pattern in the html

So, I'm just repeating the BS4 docs, except my regex is not working. Sure enough if I replace the \d{1} with 4 (as originally in the html) it succeedes.

Upvotes: 3

Views: 165

Answers (3)

PuercoPop
PuercoPop

Reputation: 6807

You are matching not for a class but for an specific combination of classes in an specific order.

From the documentation:

You can also search for the exact string value of the class attribute:

css_soup.find_all("p", class_="body strikeout")
# [<p class="body strikeout"></p>] But searching for variants of the string value won’t work:

css_soup.find_all("p", class_="strikeout body")
# []

So you should problable fist match for post_text and then in the result try to match with a regexp in the matches for that search

Upvotes: 1

000
000

Reputation: 27227

I'm not entirely sure, but this worked for me:

soup.find(attrs={'class':re.compile('pos_text pos3_l_\d{1}')})

Upvotes: 2

Chris Doggett
Chris Doggett

Reputation: 20747

Try "\\d" in your regex. It's probably interpreting "\d" as trying to escape 'd'.

Alternatively, a raw string ought to work. Just put an 'r' in front of the regex, like this:

re.compile(r"pos_text pos3_l_\d{1}")

Upvotes: 2

Related Questions