Using regular expression in find_all of Beautifulsoup

Question

I was trying to scrape tumblr archive, the div class tag looks like given in picture

The class starts with "post post_micro", I tried using regular expression but failed

soup.find_all(class_=re.compile('^post post_micro')

I tried to use function in find_all for class

def func(x):                 
    if str(x).startswith('post_tumblelog'):
        return True

and used it as:

soup.find_all(class_=func)

The above works fine and I am getting what I need. But I want to know how to do it using regular expressions and why in the func(x),

str(x).startswith('post_tumblelog')

evaluates as True when the class name is starting with "post post_micro".

Josh Crozier · Accepted Answer

In BeautifulSoup 4, you can use the .select() method since it can accept a CSS attribute selector. In your case, you would use the attribute selector [class^="post_tumblelog"], which will select class attributes starting with the string post_tumblelog.

soup.select('[class^="post_tumblelog"]')

Alternatively, you could also use:

soup.find_all(class_=lambda x: x and x.startswith('post_tumblelog'))

As a side note, it looks like you were missing a parenthesis, the following works:

soup.find_all(class_=re.compile('^post_tumblelog'))

Using regular expression in find_all of Beautifulsoup

Answers (1)

Related Questions