Python Regex not catching pattern

Question

i am basically scraping data from a particular page. I have this code:

regex = '(.*?)'

opener.open(baseurl)
urllib2.install_opener(opener)

... rest of code omitted ...

requestData = urllib2.urlopen(request)
htmlText = requestData.read()

pattern = re.compile(regex)
movies = re.findall(pattern, htmlText)

# Lines below will always returns empty.
if not movies:
    print "List is empty. Printing source instead...", "

"
    print htmlText
else:
    print movies

content of htmlText:



... bunch of s (the content i want to retrieve).

htmlText contains the correct source (i tried to ctrl+F it and i can verify that it contains the desired ul element. It just that my regex unable to get the desired content.

I have tried to use this instead:

movies = re.findall(r'(.*?)', htmlText)

Does anyone know what went wrong?

Tim Peters · Accepted Answer

By default, . in a regexp matches any character except for a newline. So your regexp can't match anything that spans more than one line (that contains at least one newline).

Change the compilation line to:

pattern = re.compile(regex, re.DOTALL)

to change the meaning of .. With re.DOTALL, . will match any character (including newline).

Python Regex not catching pattern

Answers (1)

Related Questions