Johan Hoeksma
Johan Hoeksma

Reputation: 3766

Puppeteer, save webpage and images

I'm trying to save a webpage, for offline usage with Nodejs and puppeteer. I see a lot of examples with:

await page.screenshot({path: 'example.png'});

But with a bigger webpage it's not an option. So a better option in puppeteer is to load the page and then save like:

const html = await page.content();
// ... write to file

Ok, that works. Now I am going to scroll like pages as twitter. So I decided to block all images in puppeteer page:

page.on('request', request => {
    if (request.resourceType() === 'image') {
        const imgUrl = request.url()
        download(imgUrl, 'download').then((output) => {
            images.push({url: output.url, filename: output.filename})
        }).catch((err) => {
            console.log(err)
        })
        request.abort()
    } else {
        request.continue()
    }
})

Ok, I now used the 'npm download' lib to download all the images. Yes the download images are ok :D.

Now when I save the content, I want to point it to the offline images in the source.

const html = await page.content();

But now I like to replace all the

<img src="/pic.png?id=123"> 
<img src="https://twitter.com/pics/1.png">

And also things like:

<div style="background-image: url('this_also.gif')></div>

So is there a way (in puppeteer) to scrape a big page and store the whole content offline ?

Javascript and CSS would also be nice

Update

For now I will open the big html file again with puppeteer.

And then intercept all files as: https://dom.com/img/img.jpg, /file.jpg, ....

request.respond({
    status: 200,
    contentType: 'image/jpeg',
    body: '..'
});

I can also do it with a chrome extention. But I like to have a function with some options page.html(), the same as page.pdf()

Upvotes: 10

Views: 24915

Answers (2)

Johan Hoeksma
Johan Hoeksma

Reputation: 3766

For now I will use:

https://github.com/dosyago/22120

The goal of this project:

This project literally makes your web browsing available COMPLETELY OFFLINE. 
Your browser does not even know the difference. It's literally that amazing. Yes.

Upvotes: 0

ayiis
ayiis

Reputation: 111

Let's go back to the first, you can use fullPage to take the screenshot.

await page.screenshot({path: 'example.png', fullPage: true});

If you really want to download all resources to offline, yes you can:

const fse = require('fs-extra');

page.on('response', (res) => {
    // save all the data to SOMEWHERE_TO_STORE
    await fse.outputFile(SOMEWHERE_TO_STORE, await res.buffer());
});

Then, you can browser the website offline through puppeteer with everything all right.

await page.setRequestInterception(true);
page.on('request', (req) => {
    // handle the request by responding data that you stored in SOMEWHERE_TO_STORE
    // and of course, don't forget THE_FILE_TYPE
    req.respond({
        status: 200,
        contentType: THE_FILE_TYPE,
        body: await fse.readFile(SOMEWHERE_TO_STORE),
    });
});

Upvotes: 10

Related Questions