Reputation: 85
I am new to regex and Python's urllib. I went through an online tutorial on web scraping and it had the following code. After studying up on regular expressions, it seemed to me that I could use (.+) instead of the (.+?) in my regex, but whoa was I wrong. I ended up printing way more html code than I wanted. I thought I was getting the hang of regex, but now I am confused. Please explain to me the difference between these two expressions and why it is grabbing so much html. Thanks!
ps. this is a starbucks stock quote scraper.
import urllib
import re
url = urllib.urlopen("http://finance.yahoo.com/q?s=SBUX")
htmltext = url.read()
regex = re.compile('<span id="yfs_l84_sbux">(.+?)</span>')
found = re.findall(regex, htmltext)
print found
Upvotes: 8
Views: 23070
Reputation: 44831
.+
is greedy -- it matches until it can't match any more and gives back only as much as needed.
.+?
is not -- it stops at the first opportunity.
Examples:
Assume you have this HTML:
<span id="yfs_l84_sbux">foo bar</span><span id="yfs_l84_sbux2">foo bar</span>
This regex matches the whole thing:
<span id="yfs_l84_sbux">(.+)<\/span>
It goes all the way to the end, then "gives back" one </span>
, but the rest of the regex matches that last </span>
, so the complete regex matches the entire HTML chunk.
But this regex stops at the first </span>
:
<span id="yfs_l84_sbux">(.+?)<\/span>
Upvotes: 11
Reputation: 11041
(.+)
is greedy. It takes what it can and gives back when needed.
(.+?)
is ungreedy. It takes as few as possible.
See:
delegate
[delegate] /^(.+)e/
[de]legate /^(.+?)e/
Also, comparing the "Regex debugger log" here and here will show you what the ungreedy modifier does more effectively.
Upvotes: 1
Reputation: 198324
?
is a non-greedy modifier. *
by default is a greedy repetition operator - it will gobble up everything it can; when modified by ?
it becomes non-greedy and will eat up only as much as will satisfy it.
Thus for
<span id="yfs_l84_sbux">want</span>text<span id="somethingelse">dontwant</span>
.*?</span>
will eat up want
, then hit </span>
- and this satisfies the regexp with minimal repetitions of .
, resulting in <span id="yfs_l84_sbux">want</span>
being the match. However, .*
will try to see if it can eat more - it will go and find the other </span>
, with .*?
matching want</span>text<span id="somethingelse">dontwant
, resulting in what you got - much more than you wanted.
Upvotes: 3