nut_flush
nut_flush

Reputation: 11

How do I scrape a specific table if the website has multiple tables?

I recently made a script to scrape some financial data off a website (https://www.cmegroup.com/trading/interest-rates/cleared-otc.html) so I could track changes in trading volumes for a project.

However, they seem to have slightly changed the HTML and my script does not work anymore.

I used to use this to get the values from 'table20'.

#Options for Chrome Driver (Selenium)
options = webdriver.ChromeOptions()
driver = webdriver.Chrome(chrome_options=options, executable_path=r'C:\Program Files\Anaconda\chromedriver\chromedriver.exe')
driver.get("https://www.cmegroup.com/trading/interest-rates/cleared-otc.html")

current_page = driver.page_source

#Grab all the information from website HTML

soup = BeautifulSoup(current_page, 'html.parser')
tbl = soup.find("div", {"id": "table20"})

However, tbl is now a "NoneType" with nothing in it.

I've also tried the following but to no avail:

table_2 = soup.find(lambda tag: tag.name == 'table' and tag.has_attr('id') and tag['id'] == 'table20')

So question being, how do I scrape all those currency values for table20?

Upvotes: 0

Views: 82

Answers (1)

Well, i see there's no reason to use selenium for such case as it's will slow down your task.

The website is loaded with JavaScript event which render it's data dynamically once the page loads.

requests library will not be able to render JavaScript on the fly. so you can use selenium or requests_html. and indeed there's a lot of modules which can do that.

Now, we do have another option on the table, to track from where the data is rendered. I were able to locate the XHR request which is used to retrieve the data from the back-end API and render it to the users side.

You can get the XHR request by open Developer-Tools and check Network and check XHR/JS requests made depending of the type of call such as fetch

import requests
import pandas as pd


r = requests.get("https://www.cmegroup.com/CmeWS/mvc/xsltTransformer.do?xlstDoc=/XSLT/md/irs_settlement_TOTALS.xsl&url=/md/Clearing/IRS?date=03/20/2020&exchange=XCME")
df = pd.read_html(r.content, header=0)[1][:-1]

df.iloc[:, :5].to_csv("data.csv", index=False)

Output: view-online

Output Sample:

Upvotes: 2

Related Questions