Selenium Webscraping JavaScript elements

Question

I am trying to use selenium and PhantomJS to scrape some of the elements produced by JavaScript.

My Code:

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support.ui import Select

from bs4 import BeautifulSoup
from selenium import webdriver
from collections import OrderedDict
import time

driver = webdriver.PhantomJS()
driver.get('http://www.envirostor.dtsc.ca.gov/public/profile_report?global_id=01290021&starttab=landuserestrictions')

driver.find_element_by_id('sitefacdocsTab').click()
time.sleep(5)

html = driver.page_source
soup = BeautifulSoup(html)

After the click action I still get the old page data, not the new data which is given by jQuery.

Keyur Potdar · Accepted Answer

Using `requests`

Open Developer tools > Network > XHR tab in your browser. Then, click on the Site/Facility Docs tab. You'll see an AJAX request in the XHR tab. The request is sent to this site to get the tab data.

You can scrape anything you want from that tab by simply using requests module.

import requests

r = requests.get('http://www.envirostor.dtsc.ca.gov/public/profile_report_include?global_id=01290021&ou_id=&site_id=&tabname=sitefacdocs&orderby=&schorderby=&comporderby=&rand=0.07839738919075079&_=1521609095041')
soup = BeautifulSoup(r.text, 'lxml')

# And to check whether we've got the correct data:
table = soup.find('table', class_='display-v4-default')
print(table.find('a', target='_documents').text)
# Soil Management Plan Implementation Report, Public Market Infrastructure Relocation, Phase 1-B Infrastructure Area

Using `Selenium`

When you want to wait for a page to load, you should never use time.sleep(). You should use Eplicit Waits instead. After using that, you can get the whole tab content using the .get_attribute('innerHTML') property.

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

driver = webdriver.Chrome()
driver.get('http://www.envirostor.dtsc.ca.gov/public/profile_report?global_id=01290021&starttab=landuserestrictions')

driver.find_element_by_id('sitefacdocsTab').click()
wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_element_located((By.ID, 'docdatediv')))

html = driver.find_element_by_id('sitefacdocs').get_attribute('innerHTML')
soup = BeautifulSoup(html, 'lxml')
table = soup.find('table', class_='display-v4-default')
print(table.find('a', target='_documents').text)
# Soil Management Plan Implementation Report, Public Market Infrastructure Relocation, Phase 1-B Infrastructure Area

Additional info:

The element with id="docdatediv" is the div tag that contains date range filter. I've used it as it is not present on the first tab, but is present on the tab you want. You can use any such element for the WebDriverWait.

And, the element with id="sitefacdocs" is the div tag which contains the entire tab contents (i.e. the date filter and all the tables below). So, your soup object will have all those things to scrape.

Selenium Webscraping JavaScript elements

Answers (1)

Using `requests`

Using `Selenium`

Related Questions

Selenium Webscraping JavaScript elements

Answers (1)

Using requests

Using Selenium

Related Questions

Using `requests`

Using `Selenium`