DataWizard
DataWizard

Reputation: 51

How to Use Selenium Webdriver to download files via a list of URLs

I wrote a code that use Selenium Webdriver to download files via a list of URLs but for some reason it didn't download anything to my assignedn directory. The code works perfectly fine when I only download it one by one but when I use a for loop, it doesn't work.

This is an example URL: https://www.regulations.gov/contentStreamer?documentId=WHD-2020-0007-1730&attachmentNumber=1&contentType=pdf

Here is my code:

download_dir = '/Users/datawizard/files/'

for web in down_link:
    try:
        options = webdriver.ChromeOptions()
        options.add_argument('headless')
        options.add_experimental_option("prefs", {
          "download.default_directory": '/Users/clinton/GRA_2021/scraping_project/pdf/',
          "download.prompt_for_download": False,
          "download.directory_upgrade": True,
#           "safebrowsing.enabled": True,
          "plugins.always_open_pdf_externally": True
        })
        driver = webdriver.Chrome(chrome_options=options)

        driver.command_executor._commands["send_command"] = ("POST", '/session/$sessionId/chromium/send_command')
        params = {'cmd': 'Page.setDownloadBehavior', 'params': {'behavior': 'allow', 'downloadPath': download_dir}}
        command_result = driver.execute("send_command", params)
        
        driver.get(url)
        
    except:
        print(str(web)+"Link cannot be open")

I am wondering did I do something wrong with the code since it doesn't give me any error when I ran the code above.

Upvotes: 2

Views: 9550

Answers (2)

Cong Yang
Cong Yang

Reputation: 94

There is already a wonderful solution here that can cover any scenario and I couldn't agree any more with Tihamer's comment: This is probably the most beautiful hack I've seen in years!. But with some limits (Due to I have not enough reputation there, and I couldn't wait to post it here): When the byte array is too big, eg. exceeds 30M bytes, the web driver can not handle such an amount of data and will crash. So I found an improved solution, that is when the byte array is too large, use FileSaver.js to save the binary data(Although it has a limit too, it is as large as 2GB). The script is below:

var script = document.createElement('script');
script.type = 'text/javascript';
//use relative URL to avoid mixed content error https://nedbatchelder.com/blog/200710/httphttps_transitions_and_relative_urls.html 
script.src = '//cdn.jsdelivr.net/g/filesaver.js';
script.onload = function() {
    console.log((new Date().toString()) + " : FileSaver Script is ready!");
};
document.head.appendChild(script);
var url = arguments[0];
var callback = arguments[arguments.length - 1];
var fileNameOfFileSaver = "FileSaver_Saved_File.bin";
var xhr = new XMLHttpRequest();
xhr.open('GET', url, true);
xhr.responseType = "arraybuffer";
xhr.onload = function() {
    var arrayBuffer = xhr.response;
    var byteArray = new Uint8Array(arrayBuffer);
    if (byteArray.length > 30 * 1024 * 1024) {
        console.log((new Date().toString()) + " : byteArray length greater than 30M, actually " + byteArray.length + " bytes!, will use FileSaver.js to save file as \"" + fileNameOfFileSaver + "\" instead of return byteArray directly");
        const blob = new Blob([byteArray], {
            type: "application/octet-stream"
        });
        saveAs(blob, fileNameOfFileSaver);
        //return additional info to the caller 
        byteArray = (new TextEncoder()).encode("ExecuteAsyncScript Successfully, File Saved to |" + fileNameOfFileSaver + "| " + byteArray.length + " bytes");
    } else {
        console.log((new Date().toString()) + " : byteArray length less   than 30M, actually " + byteArray.length + " bytes,  will return byteArray directly");
    }
    callback(byteArray);
};
xhr.send();

Then just pass it to the WebDriver's ExecuteAsyncScript method like here, no matter if you are using Python, Java, C#, or any other language. You will have full access to the binary data, and no need to care about cookies, and authorization like using external download programs such as curl, wget, or aria2.

Upvotes: 0

Alin Stelian
Alin Stelian

Reputation: 897

You don't need Selenium to download files, you can download files easily using the request library

import requests

for web in down_link:
    fileName = YOUR_DOWNLOAD_PATH + web.split("=")[1].split("&")[0] + ".pdf" #I created a filename
    
    r = requests.get(web, stream=True)
    with open(fileName, 'wb') as f:
        for chunk in r.iter_content():
            f.write(chunk)

Updated Answer based on Selenium

#replace the below value with your urls list
down_link = [
    'https://www.regulations.gov/contentStreamer?documentId=WHD-2020-0007-1730&attachmentNumber=1&contentType=pdf',
    'https://www.regulations.gov/contentStreamer?documentId=WHD-2020-0007-1730&attachmentNumber=1&contentType=pdf']
download_dir = "/Users/datawizard/files/"

options = webdriver.ChromeOptions()
options.add_argument('headless')
options.add_experimental_option("prefs", {
    "download.default_directory": download_dir,
    "download.prompt_for_download": False,
    "download.directory_upgrade": True,
    "plugins.always_open_pdf_externally": True
})
driver = webdriver.Chrome(chrome_options=options)


for web in down_link:
    driver.get(web)
    time.sleep(5) #wait for the download to end, a better handling it's to check if the file exists

driver.quit()

If your files don't have a unique file name - the above code will replace the existing file with the downloaded one.

Upvotes: 2

Related Questions