user3739620
user3739620

Reputation: 85

Python regex: Difference between (.+) and (.+?)

I am new to regex and Python's urllib. I went through an online tutorial on web scraping and it had the following code. After studying up on regular expressions, it seemed to me that I could use (.+) instead of the (.+?) in my regex, but whoa was I wrong. I ended up printing way more html code than I wanted. I thought I was getting the hang of regex, but now I am confused. Please explain to me the difference between these two expressions and why it is grabbing so much html. Thanks!

ps. this is a starbucks stock quote scraper.

import urllib
import re

url = urllib.urlopen("http://finance.yahoo.com/q?s=SBUX")
htmltext = url.read()
regex = re.compile('<span id="yfs_l84_sbux">(.+?)</span>')
found = re.findall(regex, htmltext)

print found

Upvotes: 8

Views: 23070

Answers (3)

elixenide
elixenide

Reputation: 44831

.+ is greedy -- it matches until it can't match any more and gives back only as much as needed.

.+? is not -- it stops at the first opportunity.

Examples:

Assume you have this HTML:

<span id="yfs_l84_sbux">foo bar</span><span id="yfs_l84_sbux2">foo bar</span>

This regex matches the whole thing:

<span id="yfs_l84_sbux">(.+)<\/span>

It goes all the way to the end, then "gives back" one </span>, but the rest of the regex matches that last </span>, so the complete regex matches the entire HTML chunk.

But this regex stops at the first </span>:

<span id="yfs_l84_sbux">(.+?)<\/span>

Upvotes: 11

Unihedron
Unihedron

Reputation: 11041

(.+) is greedy. It takes what it can and gives back when needed.

(.+?) is ungreedy. It takes as few as possible.

See:

delegate

[delegate] /^(.+)e/
[de]legate /^(.+?)e/

Also, comparing the "Regex debugger log" here and here will show you what the ungreedy modifier does more effectively.

Upvotes: 1

Amadan
Amadan

Reputation: 198324

? is a non-greedy modifier. * by default is a greedy repetition operator - it will gobble up everything it can; when modified by ? it becomes non-greedy and will eat up only as much as will satisfy it.

Thus for

<span id="yfs_l84_sbux">want</span>text<span id="somethingelse">dontwant</span>

.*?</span> will eat up want, then hit </span> - and this satisfies the regexp with minimal repetitions of ., resulting in <span id="yfs_l84_sbux">want</span> being the match. However, .* will try to see if it can eat more - it will go and find the other </span>, with .*? matching want</span>text<span id="somethingelse">dontwant, resulting in what you got - much more than you wanted.

Upvotes: 3

Related Questions