Mathlearner
Mathlearner

Reputation: 117

extracting data from an HTML table using BeautifulSoup

UPDATE:

After Pygirl suggestion I am attempting to use Selenium, but i'm still only getting the sector data:

import requests
import csv
import pandas as pd
from requests import get
from selenium import webdriver
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait as wait
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.action_chains import ActionChains
from webdriver_manager.chrome import ChromeDriverManager
from time import sleep

driver = webdriver.Chrome(ChromeDriverManager().install())
driver.set_window_size(1024, 600)
driver.maximize_window()
driver.get('https://eresearch.fidelity.com/eresearch/markets_sectors/sectors/si_performance.jhtml?tab=siperformance')
action = ActionChains(driver)
sleep(4)
industry_link = driver.find_element_by_css_selector('#tab_industry')
action.move_to_element(industry_link)
action.click(industry_link)
action.perform()

url = driver.current_url
r = requests.get(url)

sleep(10)

df_industry_list = pd.read_html(r.text)
df_industry = df_industry_list[0]
df_industry.head()
df_industry.to_excel("SectorPerf.xlsx", sheet_name = "Industry")

I'm trying to get the data from the Industry link of this url: https://eresearch.fidelity.com/eresearch/markets_sectors/sectors/si_performance.jhtml?tab=siperformance

I have written some code that will get the SECTOR link information, however my approach doesn't seem to work for the Industry as the URL appears to be the same for both the sector and the Industry tab...

import requests
from bs4 import BeautifulSoup
import csv
import pandas as pd
from requests import get

url = 'https://eresearch.fidelity.com/eresearch/markets_sectors/sectors/si_performance.jhtml?tab=siperformance'
r = requests.get(url)
#soup = BeautifulSoup(response.content, 'html.parser')

#sectors = soup.find("table", id="perfTableSort")
df_list = pd.read_html(r.text)
df = df_list[0]
df.head()
#print(df)

Given that the Url seems to be the same (at least is showing the same in my address bar on chrome), how can I also get the Industry data?

Thanks

Upvotes: 0

Views: 82

Answers (2)

Pygirl
Pygirl

Reputation: 13349

Using driver.page_source. Extract table part and store it in form of csv or excel

from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from time import sleep
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.set_window_size(1024, 600)
driver.maximize_window()
driver.get('https://eresearch.fidelity.com/eresearch/markets_sectors/sectors/si_performance.jhtml?tab=siperformance')
# action = webdriver.ActionChains(driver)
print(driver.page_source) # <--- this will give you source code for Sector
sleep(2)
industry_link = driver.find_element_by_xpath('//*[@id="tab_industry"]')
# action.move_to_element(industry_link)
industry_link.click()
# action.perform()
print(driver.page_source) # <--- this will give you source code for Industry
sleep(2)

Upvotes: 1

pritam samanta
pritam samanta

Reputation: 445

Try this..

url = 'https://eresearch.fidelity.com/eresearch/markets_sectors/si_performance.jhtml'

industry = {'tab': 'industry'}
sector = {'tab': 'sector'}

r = requests.post(url, data=industry)
#soup = BeautifulSoup(response.content, 'html.parser')

#sectors = soup.find("table", id="perfTableSort")
df_list = pd.read_html(r.text)
df = df_list[0]
df.head()

Now you can put data=industry or data=sector to get desired result..

Upvotes: 2

Related Questions