Jiri
Jiri

Reputation: 197

How to wait for all downloads to complete with Puppeteer?

I have a small web scraping application that downloads multiple files from a web application where the URLs require visting the page.

It works fine if I keep the browser instance alive in between runs, but I want to close the instance in between runs. When I call browser.close() my downloads are stopped because the chrome instance is closed before the downloads have finished.

Does puppeteer provide a way to check if downloads are still active, and wait for them to complete? I've tried page.waitForNavigation({ waitUntil: "networkidle0" }) and "networkidle2", but those seem to wait indefinitely.


Upvotes: 17

Views: 22465

Answers (11)

tristansokol
tristansokol

Reputation: 4271

Updating @B45i's answer for 2024. Verified on "puppeteer": "^21.1.1"


async function waitUntilDownload(session, fileName = '') {
  return new Promise((resolve, reject) => {
    session.on('Browser.downloadProgress', (e) => {
      if (e.state === 'completed') {
        resolve(fileName);
      } else if (e.state === 'canceled') {
        reject();
      }
    });
  });
}
//...
const session = await browser.target().createCDPSession();
  await session.send('Browser.setDownloadBehavior', {
    behavior: 'allow',
    downloadPath: filePath,
    eventsEnabled: true,
  });
//...
await waitUntilDownload(session, filepath);

Upvotes: 3

B45i
B45i

Reputation: 2601

I didn't like solutions that were checking DOM or file system for the file.

From Chrome DevTools Protocol documentation I found two events, Page.downloadProgress and Browser.downloadProgress. (Though Page.downloadProgress is marked as deprecated, that's the one that worked for me.)

This event has a property called state which tells you about the state of the download. state could be inProgress, completed and canceled.

You can wrap this event in a Promise to await it till the status changes to completed

async function waitUntilDownload(page, fileName = '') {
    return new Promise((resolve, reject) => {
        page._client().on('Page.downloadProgress', e => { // or 'Browser.downloadProgress'
            if (e.state === 'completed') {
                resolve(fileName);
            } else if (e.state === 'canceled') {
                reject();
            }
        });
    });
}

and await it as follows,

await waitUntilDownload(page, fileName);

Upvotes: 6

Milton Paucar
Milton Paucar

Reputation: 7

you could search in the download location for the extension the files have when still downloading 'crdownload' and when the download is completed the file is renamed with the original extension: from this 'video_audio_file.mp4.crdownload' turns into 'video_audio_file.mp4' without the 'crdownload' at the end

const fs = require('fs');
const myPath = path.resolve('/your/file/download/folder');
let siNo = 0;
function stillWorking(myPath) {
    siNo = 0;
    filenames = fs.readdirSync(myPath);
    filenames.forEach(file => {
        if (file.includes('crdownload')) {
            siNo = 1;
        }
    });
    return siNo;
}

Then you use is in an infinite loop like this and check very a certain period of time, here I check every 3 seconds and when it returns 0 which means there is no pending files to be fully downloaded.

while (true) {
    execSync('sleep 3');
    if (stillWorking(myPath) == 0) {
        await browser.close();
        break;
    }
}

Upvotes: -1

TeaDrinker
TeaDrinker

Reputation: 475

Update:

It's 2022. Use Playwright to get away from this mass. manage downloads

It also has 'smarter' locator, which examine selectors every time before click()


old version for puppeteer:

My solution is to use chrome's own chrome://downloads/ page to managing download files. This solution can be very easily to auto restart a failed download using chrome's own feature

This example is 'single thread' currently, because it's only monitoring the first item appear in the download manager page. But you can easily adapt it to 'infinite threads' by iterating through all download items (#frb0~#frbn) in that page, well, take care of your network:)

dmPage = await browser.newPage()
await dmPage.goto('chrome://downloads/')

await your_download_button.click() // start download

await dmPage.bringToFront() // this is necessary
await dmPage.waitForFunction(
    () => {
        // monitoring the state of the first download item
        // if finish than return true; if fail click
        const dm = document.querySelector('downloads-manager').shadowRoot
        const firstItem = dm.querySelector('#frb0')
        if (firstItem) {
            const thatArea = firstItem.shadowRoot.querySelector('.controls')
            const atag = thatArea.querySelector('a')
            if (atag && atag.textContent === '在文件夹中显示') { // may be 'show in file explorer...'? you can try some ids, classess and do a better job than me lol
                return true
            }
            const btn = thatArea.querySelector('cr-button')
            if (btn && btn.textContent === '重试') { // may be 'try again'
                btn.click()
            }
        }
    },
    { polling: 'raf', timeout: 0 }, // polling? yes. there is a 'polling: "mutation"' which kind of async
)
console.log('finish')

Upvotes: 4

Amitabh
Amitabh

Reputation: 172

You can use node-watch to report the updates to the target directory. When the file upload is complete you will receive an update event with the name of the new file that has been downloaded.

Run npm to install node-watch:

npm install node-watch

Sample code:

const puppeteer = require('puppeteer');
const watch = require('node-watch');
const path = require('path');

// Add code to initiate the download ...
const watchDir = '/Users/home/Downloads'
const filepath = path.join(watchDir, "download_file");
(async() => {
    watch(watchDir, function(event, name) {
    if (event == "update") {
        if (name === filepath)) {
            browser.close(); // use case specific
            process.exit();  // use case specific
        }
    }
})

Upvotes: 1

Zero14
Zero14

Reputation: 31

Here is another function, its just wait for the pause button to disappear:

async function waitForDownload(browser: Browser) {
  const dmPage = await browser.newPage();
  await dmPage.goto("chrome://downloads/");

  await dmPage.bringToFront();
  await dmPage.waitForFunction(() => {
    try {
      const donePath = document.querySelector("downloads-manager")!.shadowRoot!
        .querySelector(
          "#frb0",
        )!.shadowRoot!.querySelector("#pauseOrResume")!;
      if ((donePath as HTMLButtonElement).innerText != "Pause") {
        return true;
      }
    } catch {
      //
    }
  }, { timeout: 0 });
  console.log("Download finished");
}

Upvotes: 2

Delorean
Delorean

Reputation: 364

Created simple await function that will check for file rapidly or timeout in 10 seconds

import fs from "fs";

awaitFileDownloaded: async (filePath) => {
    let timeout = 10000
    const delay = 200

    return new Promise(async (resolve, reject) => {
        while (timeout > 0) {
            if (fs.existsSync(filePath)) {
                resolve(true);
                return
            } else {
                await HelperUI.delay(delay)
                timeout -= delay
            }
        }
        reject("awaitFileDownloaded timed out")
    });
},

Upvotes: 0

Sardorbek Muhtorov
Sardorbek Muhtorov

Reputation: 83

You need check request response.

await page.on('response', (response)=>{ console.log(response, response._url)}

You should check what is coming from response then find status, it comes with status 200

Upvotes: 2

Anand Biradar
Anand Biradar

Reputation: 79

Using puppeteer and chrome I have one more solution which might help you.

If you are downloading the file from chrome it will always have ".crdownload" extension. And when file is completely downloaded that extension will vanish.

So, I am using recurring function and maximum number of times it can iterate, If it doesn't download the file in that time.. I am deleting it. And I am constantly checking a folder for that extention.

async checkFileDownloaded(path, timer) {
    return new Promise(async (resolve, reject) => {
        let noOfFile;
        try {
            noOfFile = await fs.readdirSync(path);
        } catch (err) {
            return resolve("null");
        }
        for (let i in noOfFile) {
            if (noOfFile[i].includes('.crdownload')) {
                await this.delay(20000);
                if (timer == 0) {
                    fs.unlink(path + '/' + noOfFile[i], (err) => {
                    });
                    return resolve("Success");
                } else {
                    timer = timer - 1;
                    await this.checkFileDownloaded(path, timer);
                }
            }
        }
        return resolve("Success");
    });
}

Upvotes: 1

Gustave Dupre
Gustave Dupre

Reputation: 119

An alternative if you have the file name or a suggestion for other ways to check.


async function waitFile (filename) {

    return new Promise(async (resolve, reject) => {
        if (!fs.existsSync(filename)) {
            await delay(3000);    
            await waitFile(filename);
            resolve();
        }else{
          resolve();
        }

    })   
}

function delay(time) {
    return new Promise(function(resolve) { 
        setTimeout(resolve, time)
    });
}

Implementation:

var filename = `${yyyy}${mm}_TAC.csv`;
var pathWithFilename = `${config.path}\\${filename}`;
await waitFile(pathWithFilename);

Upvotes: 2

Hellonearthis
Hellonearthis

Reputation: 1762

Tried doing an await page.waitFor(50000); with a time as long as the download should take.

Or look at watching for file changes on complete file transfer

Upvotes: -1

Related Questions