Reputation: 129
Hi I am trying to scape an HTML table and I have working code.
The one URL, however, contains two html tables. The first table contains "quarterly" numbers and loads by default with the url. When you click the button above the table, you can switch to the second table with "annual" numbers.
My code only picks up first default (quarterly) table that appears when the url loads.
How can I get my python code to scrape the second "annual" table? Can selenium do this? If so could anyone provide any guidance?
#!/usr/local/bin/python3
import requests
import pandas as pd
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0'}
r = requests.get("https://www.investing.com/equities/exxon-mobil-income-statement", headers=headers)
df = pd.read_html(r.content)[1]
print(df)
Many thanks
Upvotes: 1
Views: 643
Reputation: 20362
Try the following:
Sub Web_Table()
Dim HTMLDoc As New HTMLDocument
Dim objTable As Object
Dim lRow As Long
Dim lngTable As Long
Dim lngRow As Long
Dim lngCol As Long
Dim ActRw As Long
Dim objIE As InternetExplorer
Set objIE = New InternetExplorer
objIE.Navigate "https://www.investing.com/equities/exxon-mobil-income-statement"
Do Until objIE.ReadyState = 4 And Not objIE.Busy
DoEvents
Loop
Application.Wait (Now + TimeValue("0:00:03")) 'wait for java script to load
HTMLDoc.body.innerHTML = objIE.Document.body.innerHTML
With HTMLDoc.body
Set objTable = .getElementsByTagName("table")
For lngTable = 0 To objTable.Length - 1
For lngRow = 0 To objTable(lngTable).Rows.Length - 1
For lngCol = 0 To objTable(lngTable).Rows(lngRow).Cells.Length - 1
ThisWorkbook.Sheets("Sheet1").Cells(ActRw + lngRow + 1, lngCol + 1) = objTable(lngTable).Rows(lngRow).Cells(lngCol).innerText
Next lngCol
Next lngRow
ActRw = ActRw + objTable(lngTable).Rows.Length + 1
Next lngTable
End With
objIE.Quit
End Sub
Upvotes: 0
Reputation: 129
After much googling and some other stack posts, finally got this working:
from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd
browser = webdriver.Firefox(executable_path=r'/Users/xxxxxx/Documents/python/web_drivers/geckodriver')
browser.get('https://www.investing.com/equities/exxon-mobil-income-statement')
linkElem = browser.find_element_by_link_text('Annual')
linkElem.click()
r = browser.find_element_by_css_selector("#rrtable > table").get_attribute('outerHTML')
browser.quit()
soup = BeautifulSoup(r, 'html.parser')
df = pd.read_html(str(soup))[0]
print(df)
Upvotes: 1
Reputation: 536
Yes,
You can do it with selenium.
driver.get("https://www.investing.com/equities/exxon-mobil-income-statement")
annual_button = driver.find_element_by_css_selector("#leftColumn > div.alignBottom > div.float_lang_base_1 > a:nth-child(1)")
annual_button.click()
print(driver.find_element_by_css_selector("#rrtable > table").get_attribute('innerHTML'))
Here's a python code for that.
What it does? It entesr the page, finds the annual_button element by its css selector and than clicks it. Than, it find the table by its css selector and prints the HTML of it.
Hope it helps.
Upvotes: 1