Pink
Pink

Reputation: 43

Selenium download entire html

I have been trying to use selenium to scrape and entire web page. I expect at least a handful of them are spa's such as Angular, React, Vue so that is why I am using Selenium.

I need to download the entire page (if some content isn't loaded from lazy loading because of not scrolling down that is fine). I have tried setting a time.sleep() delay, but that has not worked. After I get the page I am looking to hash it and store it in a db to compare later and check to see if the content has changed. Currently the hash is different every time and that is because selenium is not downloading the entire page, each time a different partial amount is missing. I have confirmed this on several web pages not just a singular one.

I also have probably a 1000+ web pages to go through by hand just getting all the links so I do not have time to find an element on them to make sure it is loaded.

How long this process takes is not important. If it takes 1+ hours so be it, speed is not important only accuracy.

If you have an alternative idea please also share.

My driver declaration

 from selenium import webdriver
 from selenium.common.exceptions import WebDriverException

 driverPath = '/usr/lib/chromium-browser/chromedriver'

 def create_web_driver():
     options = webdriver.ChromeOptions()
     options.add_argument('headless')

     # set the window size
     options.add_argument('window-size=1200x600')

     # try to initalize the driver
     try:
         driver = webdriver.Chrome(executable_path=driverPath, chrome_options=options)
     except WebDriverException:
         print("failed to start driver at path: " + driverPath)

     return driver

My url call my timeout = 20

 driver.get(url)
 time.sleep(timeout)
 content = driver.page_source

 content = content.encode('utf-8')
 hashed_content = hashlib.sha512(content).hexdigest()

^ getting different hash here every time since same url not producing same web page

Upvotes: 4

Views: 3235

Answers (2)

undetected Selenium
undetected Selenium

Reputation: 193308

As the Application Under Test(AUT) is based on Angular, React, Vue in that case Selenium seems to be the perfect choice.

Now, as you are fine with the fact that some content isn't loaded from lazy loading because of not scrolling makes the usecase feasible. But in all possible ways ...do not have time to find an element on them to make sure it is loaded... can't be really compensated inducing time.sleep() as time.sleep() have certain drawbacks. You can find a detailed discussion in How to sleep webdriver in python for milliseconds. It would be worth to mention that the state of the HTML DOM will be different for all the 1000 odd web pages.

Solution

A couple of viable solutions:

If you implement pageLoadStrategy, page_source method will be triggered at the same tripping point and possibly you would see identical hashed_content.

Upvotes: 1

msaba92
msaba92

Reputation: 357

In my experience time.sleep() does not work well with dynamic loading times. If the page is javascript-heavy you have to use the WebDriverWait clause.

Something like this:

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

driver = webdriver.Chrome()
driver.get(url)

element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, "[my-attribute='my-value']")))

Change 10 with whatever timer you want, and By.CSS_SELECTOR and its value with whatever type you want to use as a reference for a lo

You can also wrap the WebDriverWait around a Try/Except statement with the TimeoutException exception, which you can get from the submodule selenium.common.exceptions in case you want to set a hard limit.

You can probably set it inside a while loop if you truly want it to check forever until the page's loaded, because I couldn't find any reference in the docs about waiting "forever", but you'll have to experiment with it.

Upvotes: 1

Related Questions