Reputation: 3766
I'm trying to save a webpage, for offline usage with Nodejs and puppeteer. I see a lot of examples with:
await page.screenshot({path: 'example.png'});
But with a bigger webpage it's not an option. So a better option in puppeteer is to load the page and then save like:
const html = await page.content();
// ... write to file
Ok, that works. Now I am going to scroll like pages as twitter. So I decided to block all images in puppeteer page:
page.on('request', request => {
if (request.resourceType() === 'image') {
const imgUrl = request.url()
download(imgUrl, 'download').then((output) => {
images.push({url: output.url, filename: output.filename})
}).catch((err) => {
console.log(err)
})
request.abort()
} else {
request.continue()
}
})
Ok, I now used the 'npm download' lib to download all the images. Yes the download images are ok :D.
Now when I save the content, I want to point it to the offline images in the source.
const html = await page.content();
But now I like to replace all the
<img src="/pic.png?id=123">
<img src="https://twitter.com/pics/1.png">
And also things like:
<div style="background-image: url('this_also.gif')></div>
So is there a way (in puppeteer) to scrape a big page and store the whole content offline ?
Javascript and CSS would also be nice
Update
For now I will open the big html file again with puppeteer.
And then intercept all files as: https://dom.com/img/img.jpg, /file.jpg, ....
request.respond({
status: 200,
contentType: 'image/jpeg',
body: '..'
});
I can also do it with a chrome extention. But I like to have a function with some options page.html(), the same as page.pdf()
Upvotes: 10
Views: 24915
Reputation: 3766
For now I will use:
https://github.com/dosyago/22120
The goal of this project:
This project literally makes your web browsing available COMPLETELY OFFLINE.
Your browser does not even know the difference. It's literally that amazing. Yes.
Upvotes: 0
Reputation: 111
Let's go back to the first, you can use fullPage
to take the screenshot.
await page.screenshot({path: 'example.png', fullPage: true});
If you really want to download all resources to offline, yes you can:
const fse = require('fs-extra');
page.on('response', (res) => {
// save all the data to SOMEWHERE_TO_STORE
await fse.outputFile(SOMEWHERE_TO_STORE, await res.buffer());
});
Then, you can browser the website offline through puppeteer with everything all right.
await page.setRequestInterception(true);
page.on('request', (req) => {
// handle the request by responding data that you stored in SOMEWHERE_TO_STORE
// and of course, don't forget THE_FILE_TYPE
req.respond({
status: 200,
contentType: THE_FILE_TYPE,
body: await fse.readFile(SOMEWHERE_TO_STORE),
});
});
Upvotes: 10