Siddharth Bhatheja
Siddharth Bhatheja

Reputation: 396

Scraping infinite scrolling yahoo finance historical data

I'm trying to scrape past 5 years yahoo finance historical data for a particular stock. I have implemented a python code that is scraping each row of the table containing historical data. I know there are simpler ways to fetch historical data but I want to do it with scraping. The problem is yahoo finance has infinte scrolling implmented in it i.e. as soon as I reach the end of the website more rows are getting added dynamically to the table. But my code is fetching rows till the end of first page only and not the complete 5 years data. Here is the sample of the code that I'm trying:

After navigating to the rows during scraping part-

tableRows = table.find_all('tr', class_='BdT Bdc($seperatorColor) Ta(end) Fz(s) Whs(nw)')

I'm further extracting values from these rows

Upvotes: 0

Views: 1942

Answers (4)

bilke
bilke

Reputation: 415

A lot better solutions have been shown, but I'm just showing you how it can be done with pressing "END" key

from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys

driver = webdriver.Chrome()
driver.implicitly_wait(6)


driver.get("https://uk.finance.yahoo.com/quote/RELIANCE.NS/history?period1=1297987200&period2=1613606400&interval=1d&filter=history&frequency=1d&includeAdjustedClose=true")

driver.find_element_by_xpath('//*[@id="consent-page"]/div/div/div/form/div[2]/div[2]/button').click()

history_table = driver.find_element_by_xpath('//*[@id="Col1-1-HistoricalDataTable-Proxy"]/section/div[2]/table/tbody').find_elements_by_tag_name("tr")
# while year >= 2020 - 5
while(int(history_table[-1].find_elements_by_tag_name("td")[0].text.split()[2]) >= 2020-5):
    history_table = driver.find_element_by_xpath(
        '//*[@id="Col1-1-HistoricalDataTable-Proxy"]/section/div[2]/table/tbody').find_elements_by_tag_name("tr")
    action = ActionChains(driver)
    action.send_keys(Keys.END).perform()

Upvotes: 0

chitown88
chitown88

Reputation: 28565

Selenium is one way to do it. More efficient way is to query the data directly:

import requests
import pandas as pd
import datetime

years = 5

dt= datetime.datetime.now()
past_date = datetime.datetime(year=dt.year-years, month=dt.month, day=dt.day)

url = 'https://query2.finance.yahoo.com/v8/finance/chart/RELIANCE.NS'
headers= {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36'}
payload = {
'formatted': 'true',
'crumb': 'J2oUJNHQwXU',
'lang': 'en-GB',
'region': 'GB',
'includeAdjustedClose': 'true',
'interval': '1d',
'period1': '%s' %int(past_date.timestamp()),
'period2': '%s' %int(dt.timestamp()),
'events': 'div|split',
'useYfid': 'true',
'corsDomain': 'uk.finance.yahoo.com'}



jsonData = requests.get(url, headers=headers, params=payload).json()
result = jsonData['chart']['result'][0]

indicators = result['indicators']
rows = {'timestamp':result['timestamp']}
rows.update(indicators['adjclose'][0])
rows.update(indicators['quote'][0])

df = pd.DataFrame(rows)
df['timestamp'] = pd.to_datetime(df['timestamp'], unit='s')

Output:

print(df)
               timestamp     adjclose  ...         open          low
0    2016-03-08 03:45:00   492.139252  ...   499.019806   499.019806
1    2016-03-09 03:45:00   499.183502  ...   505.211090   504.517670
2    2016-03-10 03:45:00   484.831451  ...   516.132568   499.762756
3    2016-03-11 03:45:00   486.149292  ...   502.685059   500.555237
4    2016-03-14 03:45:00   488.665009  ...   504.765320   501.719208
                 ...          ...  ...          ...          ...
1229 2021-03-01 03:45:00  2101.699951  ...  2110.199951  2062.500000
1230 2021-03-02 03:45:00  2106.000000  ...  2122.000000  2089.100098
1231 2021-03-03 03:45:00  2202.100098  ...  2121.050049  2107.199951
1232 2021-03-04 03:45:00  2175.850098  ...  2180.000000  2157.699951
1233 2021-03-05 09:59:59  2178.699951  ...  2156.000000  2153.050049

[1234 rows x 7 columns]

Upvotes: 1

pelelter
pelelter

Reputation: 675

I suggest you try the yfinance library (https://pypi.org/project/yfinance/)

import yfinance as yf

msft = yf.Ticker("MSFT")

# get stock info
msft.info

# get historical market data
hist = msft.history(period="max")

Upvotes: 2

vmank
vmank

Reputation: 784

You need to imitate user behavior inside the browser in order to fetch the rest of the results.

  1. You can use the Selenium web driver
  2. Navigate to the end of the page unless no more results show up(this step requires Javascript, example). During this step make sure that you wait for the AJAX request to complete, otherwise you might end up with unexpected behavior.
  3. Once no more results show up, use the selector you are already using to retrieve the information

Upvotes: 1

Related Questions