Reputation: 153
I'm Currently starting a web scraper, and it's been a while since I've used python. I'm sure I have messy code too. Oh well.
def retrieveHTML():
import re
import urllib.request
from urllib.request import urlopen
urls = ["http://finance.yahoo.com/q?s=^dji", "http://finance.yahoo.com/q?s=^gspc"]
i = 0
while i < len(urls):
htmlfile = urllib.request.urlopen(urls[i])
htmltext = htmlfile.read()
if (i == 0):
regex = b'<span id="yfs_110_^dji">(.+?)</span>'
if (i == 1):
regex = b'<span id="yfs_110_^gspc">(.+?)</span>'
pattern = re.compile(regex)
price = pattern.match(htmltext)
print (price)
i += 1
retrieveHTML()
The regular expression is intended to find the price of the stock, and it returns "None". You'll find that bit of html defined as the regex by inspecting the element of the large price at the top of the page, just in case there is any ambiguity on that.
Upvotes: 0
Views: 697
Reputation: 873
I know it is off topic, :).
I would kindly suggest OP to use xpath in the xml package. I scrape websites like yahoo as well. The xml package saved me a lot of time and energy. Doing everything through regex is a pain in the neck.
Upvotes: 2
Reputation: 57460
The character ^
has special meaning in a regular expression — specifically, it matches the beginning of a line, which appears to not be what you want here. In order to match the actual character ^
instead, you have to escape it:
if (i == 0):
regex = b'<span id="yfs_110_\\^dji">(.+?)</span>'
if (i == 1):
regex = b'<span id="yfs_110_\\^gspc">(.+?)</span>'
Upvotes: 1