Reputation: 39
I'm trying to scrape Google Finance, and get the "Related Stocks" table, which has id "cc-table" and class "gf-table" based on the webpage inspector in Chrome. (Sample Link: https://www.google.com/finance?q=tsla)
But when I run .find("table") or .findAll("table"), this table does not come up. I can find JSON-looking objects with the table's contents in the HTML content in Python, but do not know how to get it. Any ideas?
Upvotes: 1
Views: 8935
Reputation: 99
You can scrape Google Finance using BeautifulSoup
web scraping library without the need to use selenium
as the data you want to extract doesn't render via Javascript. Plus it will be much faster than launching the whole browser.
from bs4 import BeautifulSoup
import requests, lxml, json
params = {
"hl": "en"
}
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36",
}
html = requests.get(f"https://www.google.com/finance?q=tsla)", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, "lxml")
ticker_data = []
for ticker in soup.select('.tOzDHb'):
title = ticker.select_one('.RwFyvf').text
price = ticker.select_one('.YMlKec').text
index = ticker.select_one('.COaKTb').text
price_change = ticker.select_one("[jsname=Fe7oBc]")["aria-label"]
ticker_data.append({
"index": index,
"title" : title,
"price" : price,
"price_change" : price_change
})
print(json.dumps(ticker_data, indent=2))
Example output
[
{
"index": "Index",
"title": "Dow Jones Industrial Average",
"price": "32,774.41",
"price_change": "Down by 0.18%"
},
{
"index": "Index",
"title": "S&P 500",
"price": "4,122.47",
"price_change": "Down by 0.42%"
},
{
"index": "TSLA",
"title": "Tesla Inc",
"price": "$850.00",
"price_change": "Down by 2.44%"
},
# ...
]
There's a scrape Google Finance Ticker Quote Data in Python blog post if you need to scrape more data from Google Finance.
Upvotes: 2
Reputation: 9440
The page is rendered with JavaScript. There are several ways to render and scrape it.
I can scrape it with Selenium. First install Selenium:
sudo pip3 install selenium
Then get a driver https://sites.google.com/a/chromium.org/chromedriver/downloads
import bs4 as bs
from selenium import webdriver
browser = webdriver.Chrome()
url = ("https://www.google.com/finance?q=tsla")
browser.get(url)
html_source = browser.page_source
browser.quit()
soup = bs.BeautifulSoup(html_source, "lxml")
for el in soup.find_all("table", {"id": "cc-table"}):
print(el.get_text())
Alternatively PyQt5
from PyQt5.QtGui import *
from PyQt5.QtCore import *
from PyQt5.QtWebKit import *
from PyQt5.QtWebKitWidgets import QWebPage
from PyQt5.QtWidgets import QApplication
import bs4 as bs
import sys
class Render(QWebPage):
def __init__(self, url):
self.app = QApplication(sys.argv)
QWebPage.__init__(self)
self.loadFinished.connect(self._loadFinished)
self.mainFrame().load(QUrl(url))
self.app.exec_()
def _loadFinished(self, result):
self.frame = self.mainFrame()
self.app.quit()
url = "https://www.google.com/finance?q=tsla"
r = Render(url)
result = r.frame.toHtml()
soup = bs.BeautifulSoup(result,'lxml')
for el in soup.find_all("table", {"id": "cc-table"}):
print(el.get_text())
Alternatively Dryscrape
import bs4 as bs
import dryscrape
url = "https://www.google.com/finance?q=tsla"
session = dryscrape.Session()
session.visit(url)
dsire_get = session.body()
soup = bs.BeautifulSoup(dsire_get,'lxml')
for el in soup.find_all("table", {"id": "cc-table"}):
print(el.get_text())
all output:
Valuation▲▼Company name▲▼Price▲▼Change▲▼Chg %▲▼d | m | y▲▼Mkt Cap▲▼TSLATesla Inc328.40-1.52-0.46%53.69BDDAIFDaimler AG72.94-1.50-2.01%76.29BFFord Motor Company11.53-0.17-1.45%45.25BGMGeneral Motors Co...36.07-0.34-0.93%53.93BRNSDFRENAULT SA EUR3.8197.000.000.00%28.69BHMCHonda Motor Co Lt...27.52-0.18-0.65%49.47BAUDVFAUDI AG NPV840.400.000.00%36.14BTMToyota Motor Corp...109.31-0.53-0.48%177.79BBAMXFBAYER MOTOREN WER...94.57-2.41-2.48%56.93BNSANYNissan Motor Co L...20.400.000.00%42.85BMMTOFMITSUBISHI MOTOR ...6.86+0.091.26%10.22B
EDIT
QtWebKit got deprecated upstream in Qt 5.5 and removed in 5.6.
You can switch to PyQt5.QtWebEngineWidgets
Upvotes: 8
Reputation: 1003
Most website owners don't like scrapers because they take data the company values, use up a whole bunch of their server time and bandwidth, and give nothing in return. Big companies like Google may have entire teams employing a whole host of methods to detect and block bots trying to scrape their data.
There are several ways around this:
Upvotes: 1