Reputation: 161
I login to a site and it gives a browser cookie.
I go to a URL and it is a json response.
How do I scrape the page after entering await page.goto('blahblahblah.json');
?
Upvotes: 15
Views: 30330
Reputation: 56865
Puppeteer is a browser automation tool, so if you're just making a request to a raw JSON document, there's nothing to automate. Prefer using Node's native fetch
, possibly with a non-robot user agent if you're running into a server block:
const url = "https://jsonplaceholder.typicode.com/users";
fetch(url)
.then(res => {
if (!res.ok) {
throw Error(res.statusText);
}
return res.json();
})
.then(data => {
console.log(data);
})
.catch(err => console.error(err));
This is faster and simpler than Puppeteer and has no dependency or install step.
If you really need Puppeteer for some reason, possibly to bypass a block or include live credentials from a site, you can do it as follows:
const puppeteer = require("puppeteer"); // ^22.6.0
const url = "https://jsonplaceholder.typicode.com/users";
let browser;
(async () => {
browser = await puppeteer.launch();
const [page] = await browser.pages();
const response = await page.goto(url, {waitUntil: "domcontentloaded"});
const data = await response.json();
console.log(data);
})()
.catch(err => console.error(err))
.finally(() => browser?.close());
This is better practice than existing answers: it closes the browser if an error is thrown, uses the response returned by goto
to avoid having to use page.on
or evaluate
, and uses "domcontentloaded"
which is the fastest navigation wait condition.
Speed comparison. Fetch:
real 0m0.474s
user 0m0.131s
sys 0m0.049s
Puppeteer:
real 0m0.662s
user 0m0.544s
sys 0m0.134s
Another trick to keep in mind about fetch: it works in the browser, which can be useful for cases when you need to hit a protected API with same origin credentials from the page, imitating a call from the client. In this case, you can run fetch
from an evaluate
block.
Upvotes: 0
Reputation: 22424
Another way which doesn't give you intermittent issues
is to evaluate the body when it becomes available and return it as JSON e.g.
const puppeteer = require('puppeteer');
async function run() {
const browser = await puppeteer.launch( {
headless: false //change to true in prod!
});
const page = await browser.newPage();
await page.goto('https://raw.githubusercontent.com/GoogleChrome/puppeteer/master/package.json');
//I would leave this here as a fail safe
await page.content();
const innerText = await page.evaluate(() => {
return JSON.parse(document.querySelector("body").innerText);
});
console.log("innerText now contains the JSON");
console.log(innerText);
//I will leave this as an excercise for you to
// write out to FS...
await browser.close();
};
run();
Upvotes: 35
Reputation: 2654
You can intercept the network response, like this:
const puppeteer = require('puppeteer');
const fs = require('fs');
(async () => {
const browser = await puppeteer.launch()
const page = await browser.newPage()
page.on('response', async response => {
console.log('got response', response._url)
const data = await response.buffer()
fs.writeFileSync('/tmp/response.json', data)
})
await page.goto('https://raw.githubusercontent.com/GoogleChrome/puppeteer/master/package.json', {waitUntil: 'networkidle0'})
await browser.close()
})()
Upvotes: 3