Reputation: 183
I'm trying the extract the value of "Next 5 Years (per annum)" for the stock BABA from the Yahoo Finance "Analysis" tab : https://finance.yahoo.com/quote/BABA/analysis?p=BABA. (It's 2.85% the second row from the bottom).
I have been trying to use those questions:
Scrape Yahoo Finance Financial Ratios
Scrape Yahoo Finance Income Statement with Python
But I can't even extract the data from the page
tried this website as well :
https://hackernoon.com/scraping-yahoo-finance-data-using-python-ayu3zyl
This is the I code wrote the get the web page data
First import the packages:
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq
Then trying to extract the data from the page:
Url= "https://finance.yahoo.com/quote/BABA/analysis?p=BABA"
r = requests.get(Url)
data = r.text
soup = BeautifulSoup(data,features="lxml")
When looking at type of "data" and "soup" objects I see that
type(data)
<class 'str'>
I can extract somehow the needed data of the row of ">Next 5 Years" using regular expressions.
But when when looking at
type(soup)
<class 'bs4.BeautifulSoup'>
And the data in it is not relevant to the page for some reason.
looks like that (copied only small part of what in the soup object):
soup
<!DOCTYPE html>
<html class="NoJs featurephone" id="atomic" lang="en-US"><head prefix="og:
http://ogp.me/ns#"><script>window.performance && window.performance.mark &&
window.performance.mark('PageStart');</script><meta charset="utf-8"/>
<title>Alibaba Group Holding Limited (BABA) Analyst Ratings, Estimates &
Forecasts - Yahoo Finance</title><meta con
tent="recommendation,analyst,analyst
rating,strong buy,strong
sell,hold,buy,sell,overweight,underweight,upgrade,downgrade,price target,EPS
estimate,revenue estimate,growth estimate,p/e
estimate,recommendation,analyst,analyst rating,strong buy,strong
sell,hold,buy,sell,overweight,underweight,upgrade,downgrade,price target,EPS
estimate,revenue estimate,growth estimate,p/e estimate" name="keywords"/>
<meta content="on" http-equiv="x-dns-prefetch-control"/><meta content="on"
property="twitter:dnt"/><meta content="90376669494" property="fb:app_id"/>
<meta content="#400090" name="theme-color"/><meta content="width=device-
width,
Thanks in Advance
Upvotes: 1
Views: 1873
Reputation: 944
Here's what I have. The issue I'm getting is a ping limit. After a certain amount of requests I'm not able to get the information.
def yahoo_growth_soup(ticker , debug_mode=False):
"""
Returns the growth estimate for a ticker from Yahoo Finance.
"""
# Set up headers to avoid getting blocked by Yahoo Finance
headers = {'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.246"}
# url = f"https://finance.yahoo.com/quote/{ticker}/analysis?p={ticker}"
url = f"https://finance.yahoo.com/quote/{ticker}/analysis"
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.content, "html5lib")
if debug_mode:
output_file_path = r'E:\Finance\Python Scripts\debug'
output_file_name = 'soup_output.txt'
with open(os.path.join(output_file_path, output_file_name), "w", encoding="utf-8") as output_file:
output_file.write(str(soup))
# find correct table:
value_element = soup.find("td", text="Next 5 Years (per annum)")
if value_element:
value = value_element.find_next_sibling("td").text
if value=='--':
return 0.0
growth_est = float(value.strip('%').replace(',', ''))
else: # value_element==None :
# print('Unable to locate Yahoo Finance Growth Estimate')
return None
return round(growth_est/100.0,3)
Upvotes: 0
Reputation: 45443
One solution is to extract the value from the JSON data in the JS using a regex. The JSON data is located in the following variable :
root.App.main = { .... };
Example :
import requests
import re
import json
r = requests.get("https://finance.yahoo.com/quote/BABA/analysis?p=BABA")
data = json.loads(re.search('root\.App\.main\s*=\s*(.*);', r.text).group(1))
field = [t for t in data["context"]["dispatcher"]["stores"]["QuoteSummaryStore"]["earningsTrend"]["trend"] if t["period"] == "+5y" ][0]
print(field)
print("Next 5 Years (per annum) : " + field["growth"]["fmt"])
Upvotes: -1