Ray
Ray

Reputation: 71

Selenium Web Scraping With Beautiful Soup on Dynamic Content and Hidden Data Table

Really need help from this community!

I am doing web scraping on Dynamic Content in Python by using Selenium and Beautiful Soup. The thing is the pricing data table can not be parsed to Python, even though using the following code:

html=browser.execute_script('return document.body.innerHTML')
sel_soup=BeautifulSoup(html, 'html.parser')  

However, What I found later is that if I click on ' View All Prices' Button on the WebPage before using the above code, I can parse that data table into python.

My Question would be How can I parse and get access to those hidden dynamic td tag info in my python without using Selenium to click on all the 'View All Prices' buttons, because there are so many.

The url for the website I am doing the Web Scraping on is https://www.cruisecritic.com/cruiseto/cruiseitineraries.cfm?port=122, and the attached picture is the html in terms of the dynamic data table which I need. enter image description here

Really appreciate the help from this community!

Upvotes: 3

Views: 2868

Answers (1)

Eduard Florinescu
Eduard Florinescu

Reputation: 17511

You should target the element after has loaded and take arguments[0] and not the entire page via document

html_of_interest=driver.execute_script('return arguments[0].innerHTML',element)
sel_soup=BeautifulSoup(html_of_interest, 'html.parser')

This has 2 practical cases:

1

the element is not yet loaded in the DOM and you need to wait for the element:

browser.get("url")
sleep(experimental) # usually get will finish only after the page is loaded but sometimes there is some JS woo running after on load time

try:
    element= WebDriverWait(browser, delay).until(EC.presence_of_element_located((By.ID, 'your_id_of_interest')))
    print "element is ready do the thing!"
    html_of_interest=driver.execute_script('return arguments[0].innerHTML',element)
    sel_soup=BeautifulSoup(html_of_interest, 'html.parser')
except TimeoutException:
    print "Somethings wrong!"   

2

the element is in a shadow root and you need to expand first the shadow root, probably not your situation but I will mention it here since it is relevant for future reference. ex:

import selenium
from selenium import webdriver
driver = webdriver.Chrome()
from bs4 import BeautifulSoup


def expand_shadow_element(element):
  shadow_root = driver.execute_script('return arguments[0].shadowRoot', element)
  return shadow_root

driver.get("chrome://settings")
root1 = driver.find_element_by_tag_name('settings-ui')

html_of_interest=driver.execute_script('return arguments[0].innerHTML',root1)
sel_soup=BeautifulSoup(html_of_interest, 'html.parser')
sel_soup# empty root not expande

shadow_root1 = expand_shadow_element(root1)

html_of_interest=driver.execute_script('return arguments[0].innerHTML',shadow_root1)
sel_soup=BeautifulSoup(html_of_interest, 'html.parser')
sel_soup

enter image description here

Upvotes: 4

Related Questions