Reputation: 33
I would like to scrape data (e.g., market capitalization, PE-ratio, etc.) from Google Finance using the BeautifulSoup-library of Python. However, when I try to extract certain passages (like "div", "tr", "td") from the html-code of the corresponding Google Finance site, using the "find_all" function, I always receive an empty list (i.e., the "base" object in the code below is empty).
During debugging, I printed the "soup" object and compared its content with the corresponding html-code. What surprised me was that the content of the "soup" object differs from the content of the html-code. I would expect that both should match in order to extract data successfully.
from bs4 import BeautifulSoup
import urllib.request
opener = urllib.request.build_opener()
opener.addheaders = [('User-Agent', 'Mozilla/5.0')]
response = opener.open('https://www.google.com/search?q=NASDAQ:GOOGL')
soup = BeautifulSoup(response, 'html.parser')
base = soup.find_all('div',{'class':'ZSM8k'})
print(soup)
print(base)
Upvotes: 2
Views: 4030
Reputation: 1724
As Imperishable Night said, it's most likely because of not sending proper request headers. User-agent
is used to act as "real" user visits so websites treat those requests as user requests. Check what's your user-agent
.
For example, if you're using requests
library, the default user-agent
is python-requests
so websites understand that it's a bot or a script that sends a request, not a real user.
Code and example in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml, json
from itertools import zip_longest # https://docs.python.org/3/library/itertools.html#itertools.zip_longest
def scrape_google_finance(ticker: str):
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
"hl": "en"
}
# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
# https://www.whatismybrowser.com/detect/what-is-my-user-agent
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36",
}
html = requests.get(f"https://www.google.com/finance/quote/{ticker}", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, "lxml")
right_panel_data = {"right_panel": {}}
right_panel_keys = soup.select(".gyFHrc .mfs7Fc")
right_panel_values = soup.select(".gyFHrc .P6K39c")
for key, value in zip_longest(right_panel_keys, right_panel_values):
key_value = key.text.lower().replace(" ", "_")
right_panel_data["right_panel"][key_value] = value.text
return right_panel_data
data = scrape_google_finance(ticker="GOOGL:NASDAQ")
print(data["right_panel"].keys())
print(data["right_panel"].get("ceo"))
# output:
"""
dict_keys(['previous_close', 'day_range', 'year_range', 'market_cap', 'volume', 'p/e_ratio', 'dividend_yield', 'primary_exchange', 'ceo', 'founded', 'headquarters', 'website', 'employees'])
Sundar Pichai
'''
If you want to scrape more data with a line-by-line explanation, there's a Scrape Google Finance Ticker Quote Data in Python blog post of mine.
Upvotes: 0
Reputation: 1533
It is entirely up to the server what content it serves you, so the best you can do is to make sure that your request looks like the request sent by the browser as much as possible. In your case, this might mean:
opener.addheaders = [('User-Agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.90 Safari/537.36')]
If I am not mistaken, this gives you what you want. You can try to remove irrelevant parts by trial-and-error if you want.
Upvotes: 1