Python regular expression for Beautiful Soup

Question

I am using Beautiful Soup to pull out specific div tags, and it seems I can't use simple string matching.

The page has some tags in the form of

which I want to ignore, and also some tags in the form of

where the x's represent an integer of arbitrary length, and the ellipses represents an arbitrary number of other values separated by white spaces (that I'm not concerned about). I can't figure out the correct regex expression, especially since I've never used python's re class.

Using

soup.find_all(class_="comment")

finds all tags starting with the word comment. I have tried using

soup.find_all(class_=re.compile(r'(comment)( )(comment)'))
soup.find_all(class_=re.compile(r'comment comment.*'))

and lots of other variations, but I think I'm missing something obvious here about how regex expressions or match() work. Can anyone help me out?

abarnert · Accepted Answer

I think I've got it:

>>> [div['class'] for div in soup.find_all('div')]
[['comment', 'form', 'new'], ['comment', 'comment-xxxx...']]

Notice that, unlike the equivalent in BS3, it's not this:

['comment form new', 'comment comment-xxxx...']

And that's why your regexps won't match.

But you can match, e.g., this:

>>> soup.find_all('div', class_=re.compile('comment-'))
[]

Note that BS does the equivalent of re.search, not re.match, so you don't need 'comment-.*'. Of course if you want to match 'comment-12345' but not 'comment-of-another-kind you'd want, e.g., 'comment-\d+'.

Python regular expression for Beautiful Soup

Answers (1)

Related Questions