chik0di
chik0di

Reputation: 9

Scraping the next pages of the same table with Python and BeautifulSoup

So I am learning webscraping and I am practicing with the YahooFinance website but it's hassle iterating through the next pages of the table i'm extracting.

I tried the below code but it only worked for the first page regardless and was not navigating through other pages.

for page in range(0, 201, 25):
    url = f'https://finance.yahoo.com/markets/stocks/most-active/?start=-{page}&count=25'
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html')

    columns = soup.find_all('div', class_='header-container yf-1dbt8wv')
    header = [name.text.strip() for name in columns]
    header.insert(1, "Name")

    data = []
    body = soup.find('tbody')
    rows = body.find_all('tr', class_= 'yf-1dbt8wv')

    for row in rows:
        point = row.find_all('td', class_='cell yf-1dbt8wv') 
        line = [case.text.strip() for case in point] 
        splitter = line[0].split(" ", 1)
        line = splitter + line[1:]
        line[1] = line[1].strip()
        line[2] = line[2].split(" ", 1)[0]
        data.append(line)

Furthermore, since the url is dynamic I tried using the url that presents all the 203 rows in the table on the same page:

url = 'https://finance.yahoo.com/markets/stocks/most-active/?start=0&count=203'
# time.sleep(5)
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

columns = soup.find_all('div', class_='header-container yf-1dbt8wv')
header = [name.text.strip() for name in columns]
header.insert(1, "Name")

data = []
body = soup.find('tbody')
rows = body.find_all('tr', class_= 'yf-1dbt8wv')

for row in rows:
    point = row.find_all('td', class_='cell yf-1dbt8wv') 
    line = [case.text.strip() for case in point] 
    splitter = line[0].split(" ", 1)
    line = splitter + line[1:]
    line[1] = line[1].strip()
    line[2] = line[2].split(" ", 1)[0]
    data.append(line)

... and even though I can literally see the entire rows in the table on one page, it kept scraping only the default 25 rows:

Am I missing something? is there something else I need to learn to get things right? I need some assistance. Thank You!

Upvotes: 0

Views: 77

Answers (2)

x1337Loser
x1337Loser

Reputation: 635

Why not use their API to fetch all this info with a single request?

Note: currently Yahoo Finance is in loading mode and I'm seeing only 60 rows, here is the code, with the API endpoint.

Sample code:

import requests

header = {
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:131.0) Gecko/20100101 Firefox/131.0"
}
url = 'https://query1.finance.yahoo.com/v1/finance/screener/predefined/saved?count=228&formatted=true&scrIds=MOST_ACTIVES&sortField=&sortType=&start=0&fields=ticker%2Csymbol%2ClongName%2CshortName%2CregularMarketPrice%2CregularMarketChange%2CregularMarketChangePercent%2CregularMarketVolume%2CaverageDailyVolume3Month%2CmarketCap%2CtrailingPE%2CfiftyTwoWeekChangePercent%2CfiftyTwoWeekRange%2CregularMarketOpen%2ClongName%2Csparkline&lang=en-US&region=US'
response = requests.get(url, headers=header)
for i in response.json()['finance']['result'][0]['quotes']:
    try:
        longName = i['longName']
    except KeyError:
        longName = i['quoteSourceName']
    trailingPE = None
    try:
        trailingPE = i['trailingPE']['fmt']
    except KeyError:
        pass
    data = [i['symbol'], longName, i['regularMarketChangePercent']['fmt'], i['regularMarketPrice']['fmt'], i['regularMarketChange']['fmt'], i['regularMarketChangePercent']['fmt'], i['regularMarketVolume']['fmt'], i['averageDailyVolume3Month']['fmt'], i['marketCap']['fmt'], trailingPE, i['fiftyTwoWeekChangePercent']['fmt'], i['fiftyTwoWeekRange']['fmt']]
    print(data)

Sample Output:

['NVDA', 'NVIDIA Corporation', '-0.55%', '140.76', '-0.78', '-0.55%', '86.331M', '319.251M', '3.453T', '65.78', '243.87%', '39.23 - 144.42']
['NIO', 'NIO Inc.', '12.26%', '5.91', '0.64', '12.26%', '65.941M', '63.518M', '12.44B', None, '-30.05%', '3.61 - 9.57']
['DJT', 'Trump Media & Technology Group Corp.', '20.46%', '46.92', '7.97', '20.46%', '64.634M', '18.256M', '9.391B', None, '157.27%', '11.75 - 79.38']
['TSLA', 'Tesla, Inc.', '0.61%', '270.85', '1.65', '0.61%', '53.774M', '80.132M', '869.429B', '74.00', '36.40%', '138.80 - 273.54']
['LCID', 'Lucid Group, Inc.', '1.40%', '2.5350', '0.0350', '1.40%', '29.93M', '37.243M', '6.603B', None, '-38.57%', '2.29 - 5.31']
['MARA', 'MARA Holdings, Inc.', '6.69%', '18.20', '1.14', '6.69%', '29.74M', '35.752M', '5.677B', '20.22', '92.77%', '8.39 - 34.09']
['OKLO', 'Oklo Inc.', '25.27%', '23.94', '4.83', '25.27%', '27.071M', '8.341M', '2.923B', None, '85.62%', '5.35 - 24.63']
['F', 'Ford Motor Company', '2.12%', '11.31', '0.24', '2.12%', '25.541M', '51.988M', '44.9B', '11.78', '13.31%', '9.49 - 14.85']
['SOFI', 'SoFi Technologies, Inc.', '2.32%', '11.24', '0.26', '2.32%', '23.789M', '44.675M', '11.986B', None, '58.36%', '6.01 - 11.34']
['CLSK', 'CleanSpark, Inc.', '5.93%', '12.07', '0.68', '5.93%', '21.688M', '23.413M', '3.118B', None, '178.48%', '3.46 - 24.72']
['AAL', 'American Airlines Group Inc.', '4.93%', '13.80', '0.65', '4.93%', '21.568M', '35.504M', '9.067B', '32.85', '17.62%', '9.07 - 16.15']
['IBRX', 'ImmunityBio, Inc.', '13.91%', '6.06', '0.74', '13.91%', '21.13M', '3.938M', '4.221B', None, '72.73%', '2.56 - 10.53']
['WULF', 'TeraWulf Inc.', '11.29%', '7.10', '0.72', '11.29%', '21.098M', '21.733M', '2.725B', None, '469.64%', '0.89 - 7.24']

Let me know if this is ok for you!

Upvotes: 0

Adon Bilivit
Adon Bilivit

Reputation: 27316

The Yahoo Finance pages are quite complex.

There may be a prompt for cookie accept/reject. You need to deal with that first of all.

Subsequently, you need to realise that the pages are driven by JavaScript and are unlikely to produce expected results using a combination of requests and BeautifulSoup. You should probably be using selenium.

The way to page forward is to look for a particular button and if it's not disabled, emulate a click. Refresh the driver and carry on.

Here's an example of how you could get all company names (which can be found in a span element with the longName class). You should be able to easily extend this to get the specific data that you want.

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver import ChromeOptions
from selenium.webdriver.common.action_chains import ActionChains

options = ChromeOptions()
options.add_argument("--headless=true")

url = "https://finance.yahoo.com/markets/stocks/most-active/"

def click(driver, e):
    action = ActionChains(driver)
    action.click(e)
    action.perform()

def reject(driver, wait):
    try:
        selector = By.CSS_SELECTOR, "button.reject-all"
        button = wait.until(EC.presence_of_element_located(selector))
        click(driver, button)
    except Exception:
        pass

def text(e):
    if r := e.text:
        return r
    return e.get_attribute("textContent")

def next_page(driver, wait):
    selector = By.CSS_SELECTOR, "div.buttons button"
    buttons = wait.until(EC.presence_of_all_elements_located(selector))
    if not buttons[2].get_attribute("disabled"):
        click(driver, buttons[2])
        driver.refresh()
        return True
    return False

with webdriver.Chrome(options) as driver:
    driver.get(url)
    wait = WebDriverWait(driver, 5)
    reject(driver, wait)
    selector = By.CSS_SELECTOR, "tbody.body tr td.cell span.longName"
    while True:
        for span in wait.until(EC.presence_of_all_elements_located(selector)):
            print(text(span))
        if not next_page(driver, wait):
            break

Upvotes: 0

Related Questions