Reputation: 71
Really need help from this community!
I am doing web scraping on Dynamic Content in Python by using Selenium and Beautiful Soup. The thing is the pricing data table can not be parsed to Python, even though using the following code:
html=browser.execute_script('return document.body.innerHTML')
sel_soup=BeautifulSoup(html, 'html.parser')
However, What I found later is that if I click on ' View All Prices' Button on the WebPage before using the above code, I can parse that data table into python.
My Question would be How can I parse and get access to those hidden dynamic td tag info in my python without using Selenium to click on all the 'View All Prices' buttons, because there are so many.
The url for the website I am doing the Web Scraping on is https://www.cruisecritic.com/cruiseto/cruiseitineraries.cfm?port=122, and the attached picture is the html in terms of the dynamic data table which I need. enter image description here
Really appreciate the help from this community!
Upvotes: 3
Views: 2868
Reputation: 17511
You should target the element after has loaded and take arguments[0]
and not the entire page via document
html_of_interest=driver.execute_script('return arguments[0].innerHTML',element)
sel_soup=BeautifulSoup(html_of_interest, 'html.parser')
This has 2 practical cases:
the element is not yet loaded in the DOM and you need to wait for the element:
browser.get("url")
sleep(experimental) # usually get will finish only after the page is loaded but sometimes there is some JS woo running after on load time
try:
element= WebDriverWait(browser, delay).until(EC.presence_of_element_located((By.ID, 'your_id_of_interest')))
print "element is ready do the thing!"
html_of_interest=driver.execute_script('return arguments[0].innerHTML',element)
sel_soup=BeautifulSoup(html_of_interest, 'html.parser')
except TimeoutException:
print "Somethings wrong!"
the element is in a shadow root and you need to expand first the shadow root, probably not your situation but I will mention it here since it is relevant for future reference. ex:
import selenium
from selenium import webdriver
driver = webdriver.Chrome()
from bs4 import BeautifulSoup
def expand_shadow_element(element):
shadow_root = driver.execute_script('return arguments[0].shadowRoot', element)
return shadow_root
driver.get("chrome://settings")
root1 = driver.find_element_by_tag_name('settings-ui')
html_of_interest=driver.execute_script('return arguments[0].innerHTML',root1)
sel_soup=BeautifulSoup(html_of_interest, 'html.parser')
sel_soup# empty root not expande
shadow_root1 = expand_shadow_element(root1)
html_of_interest=driver.execute_script('return arguments[0].innerHTML',shadow_root1)
sel_soup=BeautifulSoup(html_of_interest, 'html.parser')
sel_soup
Upvotes: 4