Rick
Rick

Reputation: 141

Saving full page content using Selenium

I was wondering what's the best way to save all the files that are retrieved when Selenium visits a site. In other words, when Selenium visits http://www.google.com I want to save the HTML, JavaScript (including scripts referenced in src tags), images, and potentially content contained in iframes. How can this be done?

I know getHTMLSource() will return the HTML content in the body of the main frame, but how can this be extended to download the complete set of files necessary to render that page again. Thanks in advance!

Upvotes: 14

Views: 31233

Answers (5)

Ashark
Ashark

Reputation: 843

I made this by downloading external sources (images) and replacing their src attribute.
Let's assume I want to save all images from <img> tags to the ../images path relative to the current page.

~/site
~/site/pages/
~/site/pages/page1.html
~/site/pages/page2.html
~/site/images/
~/site/images/img_for_page1.png
~/site/images/img_for_page2.png

I download the images with requests module.

# save_full_page.py

from selenium import webdriver
import requests

...  # open page you want to save

with open("replace_img_srcs.js", 'r') as file:
    replace_img_srcs_js = file.read()

save_dir = "/home/user/site"
save_to_file = "/home/user/site/pages/page1.html"
img_tags = driver.find_elements(By.TAG_NAME, "img")
for img_tag in img_tags:
    img_src = img_tag.get_attribute("src")
    r = requests.get(img_src, allow_redirects=True)
    img_filename = img_src.rsplit('/', 1)[1]
    open(save_dir + "/images/" + img_filename, 'wb').write(r.content)
    driver.execute_script(replace_img_srcs_js)  # see below
    with open(save_to_file, 'w') as f:
        f.write(driver.page_source)

This code edits the src attribute. I placed it to separate file to be able to see syntax highlighting. You can place the contents of it directly to the driver.execute_script(...) if you wish.

// replace_img_srcs.js

Array.prototype.slice.call(document.getElementsByTagName('img')).forEach(
 function(item) {
   var img_src = item.src;
   var img_filename = img_src.replace(/^.*[\\\/]/, '');
   var img_filename_urlencoded = encodeURIComponent(img_filename)  // because images may be named with encoded symbols
   item.src = item.src.replace(img_src, "../images/" + img_filename_urlencoded);
 }
)

Now we have page saved for autonomous use.

Upvotes: 0

pAulseperformance
pAulseperformance

Reputation: 389

The only built in method Selenium has for downloading source content is

driver = webdriver.Chrome()
driver.get('www.someurl.com')
page_source = driver.page_source

But that doesn't download all the images, css, and js scripts like you would get if you used ctrl+s on a webpage. So you'll need to emulate the ctr+s keys after you navigate to a webpage like Algorithmatic has stated.

I made a gist to show how thats done. https://gist.github.com/GrilledChickenThighs/211c307edf8f828806c4bb4e4707b106

# Download entire webpage including all javascript, html, css of webpage. Replicates ctrl+s when on a webpage.

from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys

def save_current_page():      
    ActionChains(browser).send_keys(Keys.CONTROL, "s").perform()

Upvotes: 4

Algorithmatic
Algorithmatic

Reputation: 1892

If you really want to use Selenium then what you can do is emulate Ctrl+S for saving the page, but then it's more work/difficult (also OS dependent) to emulate pressing Enter or changing the location of where you want to save the webpage and its content.

I wanted to do the same thing with Selenium but realized that I could just use tools like wget, and I really didn't need to only use Selenium.. So I ended up using wget, it's really powerful and it does exactly what I need.

This is how you would do it using wget from a Python script:

    import os
    # Save HTML
    directory = 'directory_to_save_webpage_content/'
    url = 'http://www.google.com'
    wget = "wget -p -k -P {} {}".format(directory, url)
    os.system(wget)

The args passed are just to make it possible to view the page offline as if you're still online.

--page-requisites           -p   -- get all images needed to display page
--convert-links             -k   -- convert links to be relative
--directory-prefix          -P   -- specify prefix to save files to

Upvotes: 1

Sergio Cazzolato
Sergio Cazzolato

Reputation: 462

A good tool for that is http://www.httrack.com/, Selenium doesn't provide any API for that. In case you need to save the full content of a page from your test case in selenium, perhaps you can execute httrack as a command line tool.

Thanks

Upvotes: 1

Dave Hunt
Dave Hunt

Reputation: 8223

Selenium isn't the designed for this, you could either:

  1. Use getHtmlSource and parse the resulting HTML for references to external files, which you can then download and store outside of Selenium.
  2. Use something other than Selenium to download and store an offline version of a website - I'm sure there are plenty of tools that could do this if you do a search. For example WGet can perform a recursive download (http://en.wikipedia.org/wiki/Wget#Recursive_download)

Is there any reason you want to use Selenium? Is this part of your testing strategy or are you just wanting to find a tool that will create an offline copy of a page?

Upvotes: 7

Related Questions