Jeremy
Jeremy

Reputation: 2536

Python Regex not catching pattern

i am basically scraping data from a particular page. I have this code:

regex = '<ul class="w462">(.*?)</ul>'

opener.open(baseurl)
urllib2.install_opener(opener)

... rest of code omitted ...

requestData = urllib2.urlopen(request)
htmlText = requestData.read()

pattern = re.compile(regex)
movies = re.findall(pattern, htmlText)

# Lines below will always returns empty.
if not movies:
    print "List is empty. Printing source instead...", "\n\n"
    print htmlText
else:
    print movies

content of htmlText:

<ul class="w462">

... bunch of <li>s (the content i want to retrieve).

</ul>

htmlText contains the correct source (i tried to ctrl+F it and i can verify that it contains the desired ul element. It just that my regex unable to get the desired content.

I have tried to use this instead:

movies = re.findall(r'<ul class="w462">(.*?)</ul>', htmlText)

Does anyone know what went wrong?

Upvotes: 0

Views: 119

Answers (1)

Tim Peters
Tim Peters

Reputation: 70705

By default, . in a regexp matches any character except for a newline. So your regexp can't match anything that spans more than one line (that contains at least one newline).

Change the compilation line to:

pattern = re.compile(regex, re.DOTALL)

to change the meaning of .. With re.DOTALL, . will match any character (including newline).

Upvotes: 2

Related Questions