sandepp
sandepp

Reputation: 532

Using regular expression in find_all of Beautifulsoup

I was trying to scrape tumblr archive, the div class tag looks like given in picture

enter image description here

The class starts with "post post_micro", I tried using regular expression but failed

soup.find_all(class_=re.compile('^post post_micro') 

I tried to use function in find_all for class

def func(x):                 
    if str(x).startswith('post_tumblelog'):
        return True

and used it as:

soup.find_all(class_=func)

The above works fine and I am getting what I need. But I want to know how to do it using regular expressions and why in the func(x),

str(x).startswith('post_tumblelog')

evaluates as True when the class name is starting with "post post_micro".

Upvotes: 1

Views: 4229

Answers (1)

Josh Crozier
Josh Crozier

Reputation: 240948

In BeautifulSoup 4, you can use the .select() method since it can accept a CSS attribute selector. In your case, you would use the attribute selector [class^="post_tumblelog"], which will select class attributes starting with the string post_tumblelog.

soup.select('[class^="post_tumblelog"]')

Alternatively, you could also use:

soup.find_all(class_=lambda x: x and x.startswith('post_tumblelog'))

As a side note, it looks like you were missing a parenthesis, the following works:

soup.find_all(class_=re.compile('^post_tumblelog'))  

Upvotes: 4

Related Questions