Rob B.
Rob B.

Reputation: 128

Python Regex Compile 2.7.5

This is the code I am using from Christophers Reeves tutorial on stock scraping it's his 3rd video on the subject on youtube.

import urllib
import re

symbolslist = ["aapl","spy","goog","nflx"]

i=0
while i<len(symbolslist):
    url = "http://finance.yahoo.com/q?s=" +symbolslist[i] +"&q1=1"
    htmlfile = urllib.urlopen(url)
    htmltext = htmlfile.read()
    regex = '<span id="yfs_l84_'+symbolslist[i] +'">(.?+)</span>'
    pattern = re.compile(regex)
    price = re.findall(pattern,htmltext)
    print "The price of", symbolslist[i]," is", price
    i+=1

I get the following error when I run this code in python 2.7.5

Traceback <most recent call last>:
File "fundamentalism)stocks.py, line 12, in <module>
pattern = re.compile(regex)
File "C:\Python27\lib\re.py", line 190, in compile
return _compile(pattern, flags)
File "C:\Python27\lib\re.py, line 242, in compile
raise error, v # invalid expression
sre_constant.error: multiple repeat

I don't know if the problem is with the way my library, is installed, my version of python or what. I appreciate your help.

Upvotes: 1

Views: 621

Answers (2)

Kirk Strauser
Kirk Strauser

Reputation: 30947

Others have answered about the greedy match, but on an unrelated note you'll want to write that more like:

for symbol in symbolslist:
    url = "http://finance.yahoo.com/q?s=%s&q1=1" % symbol
    htmlfile = urllib.urlopen(url)
    htmltext = htmlfile.read()
    regex = '<span id="yfs_l84_%s">(.?+)</span>' % symbol
    price = re.findall(regex, htmltext)[0]
    print "The price of", symbol," is", price
  • The standard Python idiom is to iterate across all the values in a list, not to pick them out by index.
  • "String interpolation" is a lot easier to manage than string concatenation, especially if you're adding several values into the mix (like maybe you want to specify the value of q1 in a later version).
  • re.findall takes a string as its first argument. Explicitly compiling a pattern and then throwing it away in the next loop doesn't get you anything.
  • re.findall returns a list, and you only want the first element from it.

Upvotes: 0

alecxe
alecxe

Reputation: 473983

The problem is in using multiple repeat characters: + and ?.

Probably, non-greedy matching was meant instead: (.+?):

The '*', '+', and '?' qualifiers are all greedy; they match as much text as possible. Sometimes this behavior isn’t desired; if the RE <.*> is matched against '<H1>title</H1>', it will match the entire string, and not just '<H1>'. Adding '?' after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched. Using .*? in the previous expression will match only '<H1>'..

Upvotes: 3

Related Questions