Tobson
Tobson

Reputation: 33

Web scraping from Google Finance: returned data list always empty

I would like to scrape data (e.g., market capitalization, PE-ratio, etc.) from Google Finance using the BeautifulSoup-library of Python. However, when I try to extract certain passages (like "div", "tr", "td") from the html-code of the corresponding Google Finance site, using the "find_all" function, I always receive an empty list (i.e., the "base" object in the code below is empty).

During debugging, I printed the "soup" object and compared its content with the corresponding html-code. What surprised me was that the content of the "soup" object differs from the content of the html-code. I would expect that both should match in order to extract data successfully.

from bs4 import BeautifulSoup
import urllib.request

opener = urllib.request.build_opener()
opener.addheaders = [('User-Agent', 'Mozilla/5.0')]
response = opener.open('https://www.google.com/search?q=NASDAQ:GOOGL')

soup = BeautifulSoup(response, 'html.parser')
base = soup.find_all('div',{'class':'ZSM8k'})

print(soup)
print(base)

Upvotes: 2

Views: 4030

Answers (2)

Dmitriy Zub
Dmitriy Zub

Reputation: 1724

As Imperishable Night said, it's most likely because of not sending proper request headers. User-agent is used to act as "real" user visits so websites treat those requests as user requests. Check what's your user-agent.

For example, if you're using requests library, the default user-agent is python-requests so websites understand that it's a bot or a script that sends a request, not a real user.

Code and example in the online IDE:

from bs4 import BeautifulSoup
import requests, lxml, json
from itertools import zip_longest # https://docs.python.org/3/library/itertools.html#itertools.zip_longest


def scrape_google_finance(ticker: str):
    # https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
    params = {
        "hl": "en"
        }

    # https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
    # https://www.whatismybrowser.com/detect/what-is-my-user-agent
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36",
        }

    html = requests.get(f"https://www.google.com/finance/quote/{ticker}", params=params, headers=headers, timeout=30)
    soup = BeautifulSoup(html.text, "lxml")
    
    right_panel_data = {"right_panel": {}}
    
    right_panel_keys = soup.select(".gyFHrc .mfs7Fc")
    right_panel_values = soup.select(".gyFHrc .P6K39c")
    
    for key, value in zip_longest(right_panel_keys, right_panel_values):
        key_value = key.text.lower().replace(" ", "_")

        right_panel_data["right_panel"][key_value] = value.text
    
    return right_panel_data
    

data = scrape_google_finance(ticker="GOOGL:NASDAQ") 

print(data["right_panel"].keys())

print(data["right_panel"].get("ceo"))

# output:
"""
dict_keys(['previous_close', 'day_range', 'year_range', 'market_cap', 'volume', 'p/e_ratio', 'dividend_yield', 'primary_exchange', 'ceo', 'founded', 'headquarters', 'website', 'employees'])
Sundar Pichai
'''

If you want to scrape more data with a line-by-line explanation, there's a Scrape Google Finance Ticker Quote Data in Python blog post of mine.

Upvotes: 0

Imperishable Night
Imperishable Night

Reputation: 1533

It is entirely up to the server what content it serves you, so the best you can do is to make sure that your request looks like the request sent by the browser as much as possible. In your case, this might mean:

opener.addheaders = [('User-Agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.90 Safari/537.36')]

If I am not mistaken, this gives you what you want. You can try to remove irrelevant parts by trial-and-error if you want.

Upvotes: 1

Related Questions