Reputation: 2536
i am basically scraping data from a particular page. I have this code:
regex = '<ul class="w462">(.*?)</ul>'
opener.open(baseurl)
urllib2.install_opener(opener)
... rest of code omitted ...
requestData = urllib2.urlopen(request)
htmlText = requestData.read()
pattern = re.compile(regex)
movies = re.findall(pattern, htmlText)
# Lines below will always returns empty.
if not movies:
print "List is empty. Printing source instead...", "\n\n"
print htmlText
else:
print movies
content of htmlText:
<ul class="w462">
... bunch of <li>s (the content i want to retrieve).
</ul>
htmlText contains the correct source (i tried to ctrl+F it and i can verify that it contains the desired ul element. It just that my regex unable to get the desired content.
I have tried to use this instead:
movies = re.findall(r'<ul class="w462">(.*?)</ul>', htmlText)
Does anyone know what went wrong?
Upvotes: 0
Views: 119
Reputation: 70705
By default, .
in a regexp matches any character except for a newline. So your regexp can't match anything that spans more than one line (that contains at least one newline).
Change the compilation line to:
pattern = re.compile(regex, re.DOTALL)
to change the meaning of .
. With re.DOTALL
, .
will match any character (including newline).
Upvotes: 2