Reputation: 676
I am trying to use selenium and PhantomJS to scrape some of the elements produced by JavaScript.
My Code:
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support.ui import Select
from bs4 import BeautifulSoup
from selenium import webdriver
from collections import OrderedDict
import time
driver = webdriver.PhantomJS()
driver.get('http://www.envirostor.dtsc.ca.gov/public/profile_report?global_id=01290021&starttab=landuserestrictions')
driver.find_element_by_id('sitefacdocsTab').click()
time.sleep(5)
html = driver.page_source
soup = BeautifulSoup(html)
After the click action I still get the old page data, not the new data which is given by jQuery.
Upvotes: 1
Views: 139
Reputation: 7238
requests
Open Developer tools > Network > XHR tab in your browser. Then, click on the Site/Facility Docs
tab. You'll see an AJAX request in the XHR tab. The request is sent to this site to get the tab data.
You can scrape anything you want from that tab by simply using requests
module.
import requests
r = requests.get('http://www.envirostor.dtsc.ca.gov/public/profile_report_include?global_id=01290021&ou_id=&site_id=&tabname=sitefacdocs&orderby=&schorderby=&comporderby=&rand=0.07839738919075079&_=1521609095041')
soup = BeautifulSoup(r.text, 'lxml')
# And to check whether we've got the correct data:
table = soup.find('table', class_='display-v4-default')
print(table.find('a', target='_documents').text)
# Soil Management Plan Implementation Report, Public Market Infrastructure Relocation, Phase 1-B Infrastructure Area
Selenium
When you want to wait for a page to load, you should never use time.sleep()
. You should use Eplicit Waits instead. After using that, you can get the whole tab content using the .get_attribute('innerHTML')
property.
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
driver = webdriver.Chrome()
driver.get('http://www.envirostor.dtsc.ca.gov/public/profile_report?global_id=01290021&starttab=landuserestrictions')
driver.find_element_by_id('sitefacdocsTab').click()
wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_element_located((By.ID, 'docdatediv')))
html = driver.find_element_by_id('sitefacdocs').get_attribute('innerHTML')
soup = BeautifulSoup(html, 'lxml')
table = soup.find('table', class_='display-v4-default')
print(table.find('a', target='_documents').text)
# Soil Management Plan Implementation Report, Public Market Infrastructure Relocation, Phase 1-B Infrastructure Area
Additional info:
The element with id="docdatediv"
is the div
tag that contains date range filter. I've used it as it is not present on the first tab, but is present on the tab you want. You can use any such element for the WebDriverWait
.
And, the element with id="sitefacdocs"
is the div
tag which contains the entire tab contents (i.e. the date filter and all the tables below). So, your soup
object will have all those things to scrape.
Upvotes: 1