buck54321
buck54321

Reputation: 847

Beautiful Soup get_text() output in regular expression

I am trying to extract some info from a webpage. I am using Beautiful Soup's get_text method to get the text, but when I try to pass that text through a regular expression, nothing is being returned.

import urllib2
from bs4 import BeautifulSoup
import re

url = "http://www.somesite.com"
page = BeautifulSoup(urllib2.urlopen(url))
info = {}
info['description'] = page.get_text()
print info['description'] #this works fine
print re.match(r'.',info['description'],re.UNICODE).group()

Returns None.

Upvotes: 0

Views: 816

Answers (1)

senshin
senshin

Reputation: 10360

Okay, here's probably what's going on (but I haven't checked to see if this is actually the case, since I don't have Python 2 on my machine and can't reproduce this in Python 3). If you look at the docs for re.match, you find that it reads:

re.match(pattern, string, flags=0)

If zero or more characters at the beginning of string match the regular expression pattern, return a corresponding MatchObject instance. Return None if the string does not match the pattern; note that this is different from a zero-length match.

Important point: re.match only matches at the beginning of a string.

Next, the dot character .:

'.'

(Dot.) In the default mode, this matches any character except a newline. If the DOTALL flag has been specified, this matches any character including a newline.

So, . doesn't match newlines. Therein lies the problem - if info['description'] begins with a newline, you will not get a match.

What you should do is either use re.search or pass in the re.DOTALL flag to re.match.

Upvotes: 2

Related Questions