Reputation: 197
I have a small web scraping application that downloads multiple files from a web application where the URLs require visting the page.
It works fine if I keep the browser instance alive in between runs, but I want to close the instance in between runs. When I call browser.close()
my downloads are stopped because the chrome instance is closed before the downloads have finished.
Does puppeteer provide a way to check if downloads are still active, and wait for them to complete? I've tried page.waitForNavigation({ waitUntil: "networkidle0" })
and "networkidle2"
, but those seem to wait indefinitely.
Upvotes: 17
Views: 22465
Reputation: 4271
Updating @B45i's answer for 2024. Verified on "puppeteer": "^21.1.1"
async function waitUntilDownload(session, fileName = '') {
return new Promise((resolve, reject) => {
session.on('Browser.downloadProgress', (e) => {
if (e.state === 'completed') {
resolve(fileName);
} else if (e.state === 'canceled') {
reject();
}
});
});
}
//...
const session = await browser.target().createCDPSession();
await session.send('Browser.setDownloadBehavior', {
behavior: 'allow',
downloadPath: filePath,
eventsEnabled: true,
});
//...
await waitUntilDownload(session, filepath);
Upvotes: 3
Reputation: 2601
I didn't like solutions that were checking DOM or file system for the file.
From Chrome DevTools Protocol documentation I found two events,
Page.downloadProgress
and Browser.downloadProgress
. (Though Page.downloadProgress
is marked as deprecated, that's the one that worked for me.)
This event has a property called state
which tells you about the state of the download. state
could be inProgress
, completed
and canceled
.
You can wrap this event in a Promise to await it till the status changes to completed
async function waitUntilDownload(page, fileName = '') {
return new Promise((resolve, reject) => {
page._client().on('Page.downloadProgress', e => { // or 'Browser.downloadProgress'
if (e.state === 'completed') {
resolve(fileName);
} else if (e.state === 'canceled') {
reject();
}
});
});
}
and await it as follows,
await waitUntilDownload(page, fileName);
Upvotes: 6
Reputation: 7
you could search in the download location for the extension the files have when still downloading 'crdownload' and when the download is completed the file is renamed with the original extension: from this 'video_audio_file.mp4.crdownload' turns into 'video_audio_file.mp4' without the 'crdownload' at the end
const fs = require('fs');
const myPath = path.resolve('/your/file/download/folder');
let siNo = 0;
function stillWorking(myPath) {
siNo = 0;
filenames = fs.readdirSync(myPath);
filenames.forEach(file => {
if (file.includes('crdownload')) {
siNo = 1;
}
});
return siNo;
}
Then you use is in an infinite loop like this and check very a certain period of time, here I check every 3 seconds and when it returns 0 which means there is no pending files to be fully downloaded.
while (true) {
execSync('sleep 3');
if (stillWorking(myPath) == 0) {
await browser.close();
break;
}
}
Upvotes: -1
Reputation: 475
Update:
It's 2022. Use Playwright to get away from this mass. manage downloads
It also has 'smarter' locator, which examine selectors every time before click()
old version for puppeteer:
My solution is to use chrome's own chrome://downloads/
page to managing download files. This solution can be very easily to auto restart a failed download using chrome's own feature
This example is 'single thread' currently, because it's only monitoring the first item appear in the download manager page. But you can easily adapt it to 'infinite threads' by iterating through all download items (#frb0
~#frbn
) in that page, well, take care of your network:)
dmPage = await browser.newPage()
await dmPage.goto('chrome://downloads/')
await your_download_button.click() // start download
await dmPage.bringToFront() // this is necessary
await dmPage.waitForFunction(
() => {
// monitoring the state of the first download item
// if finish than return true; if fail click
const dm = document.querySelector('downloads-manager').shadowRoot
const firstItem = dm.querySelector('#frb0')
if (firstItem) {
const thatArea = firstItem.shadowRoot.querySelector('.controls')
const atag = thatArea.querySelector('a')
if (atag && atag.textContent === '在文件夹中显示') { // may be 'show in file explorer...'? you can try some ids, classess and do a better job than me lol
return true
}
const btn = thatArea.querySelector('cr-button')
if (btn && btn.textContent === '重试') { // may be 'try again'
btn.click()
}
}
},
{ polling: 'raf', timeout: 0 }, // polling? yes. there is a 'polling: "mutation"' which kind of async
)
console.log('finish')
Upvotes: 4
Reputation: 172
You can use node-watch to report the updates to the target directory. When the file upload is complete you will receive an update event with the name of the new file that has been downloaded.
Run npm to install node-watch:
npm install node-watch
Sample code:
const puppeteer = require('puppeteer');
const watch = require('node-watch');
const path = require('path');
// Add code to initiate the download ...
const watchDir = '/Users/home/Downloads'
const filepath = path.join(watchDir, "download_file");
(async() => {
watch(watchDir, function(event, name) {
if (event == "update") {
if (name === filepath)) {
browser.close(); // use case specific
process.exit(); // use case specific
}
}
})
Upvotes: 1
Reputation: 31
Here is another function, its just wait for the pause button to disappear:
async function waitForDownload(browser: Browser) {
const dmPage = await browser.newPage();
await dmPage.goto("chrome://downloads/");
await dmPage.bringToFront();
await dmPage.waitForFunction(() => {
try {
const donePath = document.querySelector("downloads-manager")!.shadowRoot!
.querySelector(
"#frb0",
)!.shadowRoot!.querySelector("#pauseOrResume")!;
if ((donePath as HTMLButtonElement).innerText != "Pause") {
return true;
}
} catch {
//
}
}, { timeout: 0 });
console.log("Download finished");
}
Upvotes: 2
Reputation: 364
Created simple await function that will check for file rapidly or timeout in 10 seconds
import fs from "fs";
awaitFileDownloaded: async (filePath) => {
let timeout = 10000
const delay = 200
return new Promise(async (resolve, reject) => {
while (timeout > 0) {
if (fs.existsSync(filePath)) {
resolve(true);
return
} else {
await HelperUI.delay(delay)
timeout -= delay
}
}
reject("awaitFileDownloaded timed out")
});
},
Upvotes: 0
Reputation: 83
You need check request response.
await page.on('response', (response)=>{ console.log(response, response._url)}
You should check what is coming from response then find status, it comes with status 200
Upvotes: 2
Reputation: 79
Using puppeteer and chrome I have one more solution which might help you.
If you are downloading the file from chrome it will always have ".crdownload" extension. And when file is completely downloaded that extension will vanish.
So, I am using recurring function and maximum number of times it can iterate, If it doesn't download the file in that time.. I am deleting it. And I am constantly checking a folder for that extention.
async checkFileDownloaded(path, timer) {
return new Promise(async (resolve, reject) => {
let noOfFile;
try {
noOfFile = await fs.readdirSync(path);
} catch (err) {
return resolve("null");
}
for (let i in noOfFile) {
if (noOfFile[i].includes('.crdownload')) {
await this.delay(20000);
if (timer == 0) {
fs.unlink(path + '/' + noOfFile[i], (err) => {
});
return resolve("Success");
} else {
timer = timer - 1;
await this.checkFileDownloaded(path, timer);
}
}
}
return resolve("Success");
});
}
Upvotes: 1
Reputation: 119
An alternative if you have the file name or a suggestion for other ways to check.
async function waitFile (filename) {
return new Promise(async (resolve, reject) => {
if (!fs.existsSync(filename)) {
await delay(3000);
await waitFile(filename);
resolve();
}else{
resolve();
}
})
}
function delay(time) {
return new Promise(function(resolve) {
setTimeout(resolve, time)
});
}
Implementation:
var filename = `${yyyy}${mm}_TAC.csv`;
var pathWithFilename = `${config.path}\\${filename}`;
await waitFile(pathWithFilename);
Upvotes: 2
Reputation: 1762
Tried doing an await page.waitFor(50000);
with a time as long as the download should take.
Or look at watching for file changes on complete file transfer
Upvotes: -1