Reputation: 9
So I am learning webscraping and I am practicing with the YahooFinance website but it's hassle iterating through the next pages of the table i'm extracting.
I tried the below code but it only worked for the first page regardless and was not navigating through other pages.
for page in range(0, 201, 25):
url = f'https://finance.yahoo.com/markets/stocks/most-active/?start=-{page}&count=25'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html')
columns = soup.find_all('div', class_='header-container yf-1dbt8wv')
header = [name.text.strip() for name in columns]
header.insert(1, "Name")
data = []
body = soup.find('tbody')
rows = body.find_all('tr', class_= 'yf-1dbt8wv')
for row in rows:
point = row.find_all('td', class_='cell yf-1dbt8wv')
line = [case.text.strip() for case in point]
splitter = line[0].split(" ", 1)
line = splitter + line[1:]
line[1] = line[1].strip()
line[2] = line[2].split(" ", 1)[0]
data.append(line)
Furthermore, since the url is dynamic I tried using the url that presents all the 203 rows in the table on the same page:
url = 'https://finance.yahoo.com/markets/stocks/most-active/?start=0&count=203'
# time.sleep(5)
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
columns = soup.find_all('div', class_='header-container yf-1dbt8wv')
header = [name.text.strip() for name in columns]
header.insert(1, "Name")
data = []
body = soup.find('tbody')
rows = body.find_all('tr', class_= 'yf-1dbt8wv')
for row in rows:
point = row.find_all('td', class_='cell yf-1dbt8wv')
line = [case.text.strip() for case in point]
splitter = line[0].split(" ", 1)
line = splitter + line[1:]
line[1] = line[1].strip()
line[2] = line[2].split(" ", 1)[0]
data.append(line)
... and even though I can literally see the entire rows in the table on one page, it kept scraping only the default 25 rows:
Am I missing something? is there something else I need to learn to get things right? I need some assistance. Thank You!
Upvotes: 0
Views: 77
Reputation: 635
Why not use their API to fetch all this info with a single request?
Note: currently Yahoo Finance is in loading mode and I'm seeing only 60 rows, here is the code, with the API endpoint.
import requests
header = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:131.0) Gecko/20100101 Firefox/131.0"
}
url = 'https://query1.finance.yahoo.com/v1/finance/screener/predefined/saved?count=228&formatted=true&scrIds=MOST_ACTIVES&sortField=&sortType=&start=0&fields=ticker%2Csymbol%2ClongName%2CshortName%2CregularMarketPrice%2CregularMarketChange%2CregularMarketChangePercent%2CregularMarketVolume%2CaverageDailyVolume3Month%2CmarketCap%2CtrailingPE%2CfiftyTwoWeekChangePercent%2CfiftyTwoWeekRange%2CregularMarketOpen%2ClongName%2Csparkline&lang=en-US®ion=US'
response = requests.get(url, headers=header)
for i in response.json()['finance']['result'][0]['quotes']:
try:
longName = i['longName']
except KeyError:
longName = i['quoteSourceName']
trailingPE = None
try:
trailingPE = i['trailingPE']['fmt']
except KeyError:
pass
data = [i['symbol'], longName, i['regularMarketChangePercent']['fmt'], i['regularMarketPrice']['fmt'], i['regularMarketChange']['fmt'], i['regularMarketChangePercent']['fmt'], i['regularMarketVolume']['fmt'], i['averageDailyVolume3Month']['fmt'], i['marketCap']['fmt'], trailingPE, i['fiftyTwoWeekChangePercent']['fmt'], i['fiftyTwoWeekRange']['fmt']]
print(data)
['NVDA', 'NVIDIA Corporation', '-0.55%', '140.76', '-0.78', '-0.55%', '86.331M', '319.251M', '3.453T', '65.78', '243.87%', '39.23 - 144.42']
['NIO', 'NIO Inc.', '12.26%', '5.91', '0.64', '12.26%', '65.941M', '63.518M', '12.44B', None, '-30.05%', '3.61 - 9.57']
['DJT', 'Trump Media & Technology Group Corp.', '20.46%', '46.92', '7.97', '20.46%', '64.634M', '18.256M', '9.391B', None, '157.27%', '11.75 - 79.38']
['TSLA', 'Tesla, Inc.', '0.61%', '270.85', '1.65', '0.61%', '53.774M', '80.132M', '869.429B', '74.00', '36.40%', '138.80 - 273.54']
['LCID', 'Lucid Group, Inc.', '1.40%', '2.5350', '0.0350', '1.40%', '29.93M', '37.243M', '6.603B', None, '-38.57%', '2.29 - 5.31']
['MARA', 'MARA Holdings, Inc.', '6.69%', '18.20', '1.14', '6.69%', '29.74M', '35.752M', '5.677B', '20.22', '92.77%', '8.39 - 34.09']
['OKLO', 'Oklo Inc.', '25.27%', '23.94', '4.83', '25.27%', '27.071M', '8.341M', '2.923B', None, '85.62%', '5.35 - 24.63']
['F', 'Ford Motor Company', '2.12%', '11.31', '0.24', '2.12%', '25.541M', '51.988M', '44.9B', '11.78', '13.31%', '9.49 - 14.85']
['SOFI', 'SoFi Technologies, Inc.', '2.32%', '11.24', '0.26', '2.32%', '23.789M', '44.675M', '11.986B', None, '58.36%', '6.01 - 11.34']
['CLSK', 'CleanSpark, Inc.', '5.93%', '12.07', '0.68', '5.93%', '21.688M', '23.413M', '3.118B', None, '178.48%', '3.46 - 24.72']
['AAL', 'American Airlines Group Inc.', '4.93%', '13.80', '0.65', '4.93%', '21.568M', '35.504M', '9.067B', '32.85', '17.62%', '9.07 - 16.15']
['IBRX', 'ImmunityBio, Inc.', '13.91%', '6.06', '0.74', '13.91%', '21.13M', '3.938M', '4.221B', None, '72.73%', '2.56 - 10.53']
['WULF', 'TeraWulf Inc.', '11.29%', '7.10', '0.72', '11.29%', '21.098M', '21.733M', '2.725B', None, '469.64%', '0.89 - 7.24']
Let me know if this is ok for you!
Upvotes: 0
Reputation: 27316
The Yahoo Finance pages are quite complex.
There may be a prompt for cookie accept/reject. You need to deal with that first of all.
Subsequently, you need to realise that the pages are driven by JavaScript and are unlikely to produce expected results using a combination of requests and BeautifulSoup. You should probably be using selenium.
The way to page forward is to look for a particular button and if it's not disabled, emulate a click. Refresh the driver and carry on.
Here's an example of how you could get all company names (which can be found in a span element with the longName class). You should be able to easily extend this to get the specific data that you want.
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver import ChromeOptions
from selenium.webdriver.common.action_chains import ActionChains
options = ChromeOptions()
options.add_argument("--headless=true")
url = "https://finance.yahoo.com/markets/stocks/most-active/"
def click(driver, e):
action = ActionChains(driver)
action.click(e)
action.perform()
def reject(driver, wait):
try:
selector = By.CSS_SELECTOR, "button.reject-all"
button = wait.until(EC.presence_of_element_located(selector))
click(driver, button)
except Exception:
pass
def text(e):
if r := e.text:
return r
return e.get_attribute("textContent")
def next_page(driver, wait):
selector = By.CSS_SELECTOR, "div.buttons button"
buttons = wait.until(EC.presence_of_all_elements_located(selector))
if not buttons[2].get_attribute("disabled"):
click(driver, buttons[2])
driver.refresh()
return True
return False
with webdriver.Chrome(options) as driver:
driver.get(url)
wait = WebDriverWait(driver, 5)
reject(driver, wait)
selector = By.CSS_SELECTOR, "tbody.body tr td.cell span.longName"
while True:
for span in wait.until(EC.presence_of_all_elements_located(selector)):
print(text(span))
if not next_page(driver, wait):
break
Upvotes: 0