Reputation: 847
I am trying to extract some info from a webpage. I am using Beautiful Soup's get_text method to get the text, but when I try to pass that text through a regular expression, nothing is being returned.
import urllib2
from bs4 import BeautifulSoup
import re
url = "http://www.somesite.com"
page = BeautifulSoup(urllib2.urlopen(url))
info = {}
info['description'] = page.get_text()
print info['description'] #this works fine
print re.match(r'.',info['description'],re.UNICODE).group()
Returns None.
Upvotes: 0
Views: 816
Reputation: 10360
Okay, here's probably what's going on (but I haven't checked to see if this is actually the case, since I don't have Python 2 on my machine and can't reproduce this in Python 3). If you look at the docs for re.match
, you find that it reads:
re.match
(pattern, string, flags=0)If zero or more characters at the beginning of string match the regular expression pattern, return a corresponding
MatchObject
instance. ReturnNone
if the string does not match the pattern; note that this is different from a zero-length match.
Important point: re.match
only matches at the beginning of a string.
Next, the dot character .
:
'.'
(Dot.) In the default mode, this matches any character except a newline. If the
DOTALL
flag has been specified, this matches any character including a newline.
So, .
doesn't match newlines. Therein lies the problem - if info['description']
begins with a newline, you will not get a match.
What you should do is either use re.search
or pass in the re.DOTALL
flag to re.match
.
Upvotes: 2