facetiousfactorial
facetiousfactorial

Reputation: 153

Regular Expressions Python 3.4

I'm Currently starting a web scraper, and it's been a while since I've used python. I'm sure I have messy code too. Oh well.

def retrieveHTML():
import re
import urllib.request
from urllib.request import urlopen


urls = ["http://finance.yahoo.com/q?s=^dji", "http://finance.yahoo.com/q?s=^gspc"]
i = 0
while i < len(urls):

    htmlfile = urllib.request.urlopen(urls[i])
    htmltext = htmlfile.read()

    if (i == 0):
        regex = b'<span id="yfs_110_^dji">(.+?)</span>'
    if (i == 1):
        regex = b'<span id="yfs_110_^gspc">(.+?)</span>'

    pattern = re.compile(regex)
    price = pattern.match(htmltext)
    print (price)
    i += 1
retrieveHTML()

The regular expression is intended to find the price of the stock, and it returns "None". You'll find that bit of html defined as the regex by inspecting the element of the large price at the top of the page, just in case there is any ambiguity on that.

Upvotes: 0

Views: 697

Answers (2)

Gang Liang
Gang Liang

Reputation: 873

I know it is off topic, :).

I would kindly suggest OP to use xpath in the xml package. I scrape websites like yahoo as well. The xml package saved me a lot of time and energy. Doing everything through regex is a pain in the neck.

Upvotes: 2

jwodder
jwodder

Reputation: 57460

The character ^ has special meaning in a regular expression — specifically, it matches the beginning of a line, which appears to not be what you want here. In order to match the actual character ^ instead, you have to escape it:

if (i == 0):
    regex = b'<span id="yfs_110_\\^dji">(.+?)</span>'
if (i == 1):
    regex = b'<span id="yfs_110_\\^gspc">(.+?)</span>'

Upvotes: 1

Related Questions