Reputation: 51
I wrote a code that use Selenium Webdriver to download files via a list of URLs but for some reason it didn't download anything to my assignedn directory. The code works perfectly fine when I only download it one by one but when I use a for loop, it doesn't work.
This is an example URL: https://www.regulations.gov/contentStreamer?documentId=WHD-2020-0007-1730&attachmentNumber=1&contentType=pdf
Here is my code:
download_dir = '/Users/datawizard/files/'
for web in down_link:
try:
options = webdriver.ChromeOptions()
options.add_argument('headless')
options.add_experimental_option("prefs", {
"download.default_directory": '/Users/clinton/GRA_2021/scraping_project/pdf/',
"download.prompt_for_download": False,
"download.directory_upgrade": True,
# "safebrowsing.enabled": True,
"plugins.always_open_pdf_externally": True
})
driver = webdriver.Chrome(chrome_options=options)
driver.command_executor._commands["send_command"] = ("POST", '/session/$sessionId/chromium/send_command')
params = {'cmd': 'Page.setDownloadBehavior', 'params': {'behavior': 'allow', 'downloadPath': download_dir}}
command_result = driver.execute("send_command", params)
driver.get(url)
except:
print(str(web)+"Link cannot be open")
I am wondering did I do something wrong with the code since it doesn't give me any error when I ran the code above.
Upvotes: 2
Views: 9550
Reputation: 94
There is already a wonderful solution here that can cover any scenario and I couldn't agree any more with Tihamer's comment: This is probably the most beautiful hack I've seen in years!
. But with some limits (Due to I have not enough reputation there, and I couldn't wait to post it here): When the byte array is too big, eg. exceeds 30M bytes, the web driver can not handle such an amount of data and will crash. So I found an improved solution, that is when the byte array is too large, use FileSaver.js
to save the binary data(Although it has a limit too, it is as large as 2GB). The script is below:
var script = document.createElement('script');
script.type = 'text/javascript';
//use relative URL to avoid mixed content error https://nedbatchelder.com/blog/200710/httphttps_transitions_and_relative_urls.html
script.src = '//cdn.jsdelivr.net/g/filesaver.js';
script.onload = function() {
console.log((new Date().toString()) + " : FileSaver Script is ready!");
};
document.head.appendChild(script);
var url = arguments[0];
var callback = arguments[arguments.length - 1];
var fileNameOfFileSaver = "FileSaver_Saved_File.bin";
var xhr = new XMLHttpRequest();
xhr.open('GET', url, true);
xhr.responseType = "arraybuffer";
xhr.onload = function() {
var arrayBuffer = xhr.response;
var byteArray = new Uint8Array(arrayBuffer);
if (byteArray.length > 30 * 1024 * 1024) {
console.log((new Date().toString()) + " : byteArray length greater than 30M, actually " + byteArray.length + " bytes!, will use FileSaver.js to save file as \"" + fileNameOfFileSaver + "\" instead of return byteArray directly");
const blob = new Blob([byteArray], {
type: "application/octet-stream"
});
saveAs(blob, fileNameOfFileSaver);
//return additional info to the caller
byteArray = (new TextEncoder()).encode("ExecuteAsyncScript Successfully, File Saved to |" + fileNameOfFileSaver + "| " + byteArray.length + " bytes");
} else {
console.log((new Date().toString()) + " : byteArray length less than 30M, actually " + byteArray.length + " bytes, will return byteArray directly");
}
callback(byteArray);
};
xhr.send();
Then just pass it to the WebDriver's ExecuteAsyncScript
method like here, no matter if you are using Python, Java, C#, or any other language. You will have full access to the binary data, and no need to care about cookies, and authorization like using external download programs such as curl, wget, or aria2.
Upvotes: 0
Reputation: 897
You don't need Selenium to download files, you can download files easily using the request
library
import requests
for web in down_link:
fileName = YOUR_DOWNLOAD_PATH + web.split("=")[1].split("&")[0] + ".pdf" #I created a filename
r = requests.get(web, stream=True)
with open(fileName, 'wb') as f:
for chunk in r.iter_content():
f.write(chunk)
Updated Answer based on Selenium
#replace the below value with your urls list
down_link = [
'https://www.regulations.gov/contentStreamer?documentId=WHD-2020-0007-1730&attachmentNumber=1&contentType=pdf',
'https://www.regulations.gov/contentStreamer?documentId=WHD-2020-0007-1730&attachmentNumber=1&contentType=pdf']
download_dir = "/Users/datawizard/files/"
options = webdriver.ChromeOptions()
options.add_argument('headless')
options.add_experimental_option("prefs", {
"download.default_directory": download_dir,
"download.prompt_for_download": False,
"download.directory_upgrade": True,
"plugins.always_open_pdf_externally": True
})
driver = webdriver.Chrome(chrome_options=options)
for web in down_link:
driver.get(web)
time.sleep(5) #wait for the download to end, a better handling it's to check if the file exists
driver.quit()
If your files don't have a unique file name - the above code will replace the existing file with the downloaded one.
Upvotes: 2