Reputation: 2923
I have read that to render javascript to scrape the raw html, I will need to use selenium and a webdriver like phantomjs. However, doing so still does not render the javascripts for me. Below is a sample script.
Anyone?
from selenium import webdriver
import time
url="http://www.sgx.com/wps/portal/sgxweb/home/company_disclosure/stockfacts?page=2&code=5TG&lang=en-us"
PJ = r'/xxx/MyPythonScripts/phantomjs_mac'
driver = webdriver.PhantomJS(PJ)
driver.get(url)
time.sleep(3)
html=driver.page_source.encode('utf-8')
print html
Upvotes: 1
Views: 1647
Reputation: 52675
Page content, as you've mentioned, is generated by JavaScript
code, so you won't be able to find it in initial page source and even adding time.sleep(3)
could be not enough... You need to wait some time until required data present on page. Try to use below code:
from selenium import webdriver as web
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
url="http://www.sgx.com/wps/portal/sgxweb/home/company_disclosure/stockfacts?page=2&code=5TG&lang=en-us"
PJ = r'/xxx/MyPythonScripts/phantomjs_mac'
driver = webdriver.PhantomJS(PJ)
driver.get(url)
WebDriverWait(driver, 10).until(EC.frame_to_be_available_and_switch_to_it((By.XPATH,'//div[starts-with(@id, "mainns_")]/iframe')))
WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, '//div[@class="data-point-container section-break"]/table')))
html = driver.page_source
assert "Total Revenue" in html
With this code you will wait up to 10 seconds (you can increase timeout if you need) until required table
element presence. If it not rendered within 10 seconds, you'll get TimeOutException
Upvotes: 1