Amy Coin
Amy Coin

Reputation: 161

How to scrape JSON from puppeteer?

I login to a site and it gives a browser cookie.

I go to a URL and it is a json response.

How do I scrape the page after entering await page.goto('blahblahblah.json'); ?

Upvotes: 15

Views: 30330

Answers (3)

ggorlen
ggorlen

Reputation: 56865

Puppeteer is a browser automation tool, so if you're just making a request to a raw JSON document, there's nothing to automate. Prefer using Node's native fetch, possibly with a non-robot user agent if you're running into a server block:

const url = "https://jsonplaceholder.typicode.com/users";

fetch(url)
  .then(res => {
    if (!res.ok) {
      throw Error(res.statusText);
    }

    return res.json();
  })
  .then(data => {
    console.log(data);
  })
  .catch(err => console.error(err));

This is faster and simpler than Puppeteer and has no dependency or install step.

If you really need Puppeteer for some reason, possibly to bypass a block or include live credentials from a site, you can do it as follows:

const puppeteer = require("puppeteer"); // ^22.6.0

const url = "https://jsonplaceholder.typicode.com/users";

let browser;
(async () => {
  browser = await puppeteer.launch();
  const [page] = await browser.pages();
  const response = await page.goto(url, {waitUntil: "domcontentloaded"});
  const data = await response.json();
  console.log(data);
})()
  .catch(err => console.error(err))
  .finally(() => browser?.close());

This is better practice than existing answers: it closes the browser if an error is thrown, uses the response returned by goto to avoid having to use page.on or evaluate, and uses "domcontentloaded" which is the fastest navigation wait condition.

Speed comparison. Fetch:

real 0m0.474s
user 0m0.131s
sys  0m0.049s

Puppeteer:

real 0m0.662s
user 0m0.544s
sys  0m0.134s

Another trick to keep in mind about fetch: it works in the browser, which can be useful for cases when you need to hit a protected API with same origin credentials from the page, imitating a call from the client. In this case, you can run fetch from an evaluate block.

Upvotes: 0

Rippo
Rippo

Reputation: 22424

Another way which doesn't give you intermittent issues is to evaluate the body when it becomes available and return it as JSON e.g.

const puppeteer = require('puppeteer'); 

async function run() {

    const browser = await puppeteer.launch( {
        headless: false  //change to true in prod!
    }); 

    const page = await browser.newPage(); 

    await page.goto('https://raw.githubusercontent.com/GoogleChrome/puppeteer/master/package.json');

   //I would leave this here as a fail safe
    await page.content(); 

    const innerText = await page.evaluate(() =>  {
        return JSON.parse(document.querySelector("body").innerText); 
    }); 

    console.log("innerText now contains the JSON");
    console.log(innerText);

    //I will leave this as an excercise for you to
    //  write out to FS...

    await browser.close(); 

};

run(); 

Upvotes: 35

Pasi
Pasi

Reputation: 2654

You can intercept the network response, like this:

const puppeteer = require('puppeteer');
const fs = require('fs');
(async () => {
  const browser = await puppeteer.launch()
  const page = await browser.newPage()
  page.on('response', async response => {
    console.log('got response', response._url)
    const data = await response.buffer()
    fs.writeFileSync('/tmp/response.json', data)
  })
  await page.goto('https://raw.githubusercontent.com/GoogleChrome/puppeteer/master/package.json', {waitUntil: 'networkidle0'})
  await browser.close()
})()

Upvotes: 3

Related Questions