Reputation: 128
This is the code I am using from Christophers Reeves tutorial on stock scraping it's his 3rd video on the subject on youtube.
import urllib
import re
symbolslist = ["aapl","spy","goog","nflx"]
i=0
while i<len(symbolslist):
url = "http://finance.yahoo.com/q?s=" +symbolslist[i] +"&q1=1"
htmlfile = urllib.urlopen(url)
htmltext = htmlfile.read()
regex = '<span id="yfs_l84_'+symbolslist[i] +'">(.?+)</span>'
pattern = re.compile(regex)
price = re.findall(pattern,htmltext)
print "The price of", symbolslist[i]," is", price
i+=1
I get the following error when I run this code in python 2.7.5
Traceback <most recent call last>:
File "fundamentalism)stocks.py, line 12, in <module>
pattern = re.compile(regex)
File "C:\Python27\lib\re.py", line 190, in compile
return _compile(pattern, flags)
File "C:\Python27\lib\re.py, line 242, in compile
raise error, v # invalid expression
sre_constant.error: multiple repeat
I don't know if the problem is with the way my library, is installed, my version of python or what. I appreciate your help.
Upvotes: 1
Views: 621
Reputation: 30947
Others have answered about the greedy match, but on an unrelated note you'll want to write that more like:
for symbol in symbolslist:
url = "http://finance.yahoo.com/q?s=%s&q1=1" % symbol
htmlfile = urllib.urlopen(url)
htmltext = htmlfile.read()
regex = '<span id="yfs_l84_%s">(.?+)</span>' % symbol
price = re.findall(regex, htmltext)[0]
print "The price of", symbol," is", price
q1
in a later version).re.findall
takes a string as its first argument. Explicitly compiling a pattern and then throwing it away in the next loop doesn't get you anything.re.findall
returns a list, and you only want the first element from it.Upvotes: 0
Reputation: 473983
The problem is in using multiple repeat characters: +
and ?
.
Probably, non-greedy matching was meant instead: (.+?)
:
The '
*
', '+
', and '?
' qualifiers are all greedy; they match as much text as possible. Sometimes this behavior isn’t desired; if the RE<.*>
is matched against '<H1>title</H1>
', it will match the entire string, and not just '<H1>
'. Adding '?
' after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched. Using.*?
in the previous expression will match only '<H1>
'..
Upvotes: 3