Allen
Allen

Reputation: 427

Python Web Scraping Problems

I am using Python to scrape AAPL's stock price from Yahoo finance. But the program always returns []. I would appreciate if someone could point out why the program is not working. Here is my code:

import urllib
import re
htmlfile=urllib.urlopen("https://ca.finance.yahoo.com/q?s=AAPL&ql=0")
htmltext=htmlfile.read()
regex='<span id=\"yfs_l84_aapl\" class="">(.+?)</span>'
pattern=re.compile(regex)
price=re.findall(pattern,htmltext)
print price

The original source is like this:

<span id="yfs_l84_aapl" class>112.31</span>

Here I just want the price 112.31. I copy and paste the code and find 'class' changes to 'class=""'. I also tried code

regex='<span id=\"yfs_l84_aapl\" class="">(.+?)</span>'

But it does not work either.

Upvotes: 9

Views: 800

Answers (3)

galaxyan
galaxyan

Reputation: 6111

I am using BeautifulSoup to get the text from span tag

import urllib
from BeautifulSoup import BeautifulSoup

response =urllib.urlopen("https://ca.finance.yahoo.com/q?s=AAPL&ql=0")
html = response.read()
soup = BeautifulSoup(html)
# find all the spans have id = 'yfs_l84_aapl'
target = soup.findAll('span',{'id':"yfs_l84_aapl"})
# target is a list 
print(target[0].string)

Upvotes: 1

Pyrogrammer
Pyrogrammer

Reputation: 183

When I went to the yahoo site you provided, I saw a span tag without class attribute.

<span id="yfs_l84_aapl">112.31</span>

Not sure what you are trying to do with "class." Without that I get 112.31

import urllib
import re
htmlfile=urllib.urlopen("https://ca.finance.yahoo.com/q?s=AAPL&ql=0")
htmltext=htmlfile.read()
regex='<span id=\"yfs_l84_aapl\">(.+?)</span>'
pattern=re.compile(regex)
price=re.findall(pattern,htmltext)
print price

Upvotes: 2

Shawn Mehan
Shawn Mehan

Reputation: 4568

Well, the good news is that you are getting the data. You were nearly there. I would recommend that you work our your regex problems in a tool that helps, e.g. regex101.

Anyway, here is your working regex:

regex='<span id="yfs_l84_aapl">(\d*\.\d\d)'

You are collecting only digits, so don't do the general catch, be specific where you can. This is multiple digits, with a decimal literal, with two more digits.

Upvotes: 5

Related Questions