knocked loose
knocked loose

Reputation: 3314

Way to scrape a JS-Rendered page?

I'm currently scraping a list of URLs on my site using the request-promise npm module.

This works well for what I need, however, I'm noticing that not all of my divs are appearing because some are rendered after the fact with JS. I know I can't run that JS code remotely to force the render, but is there any ways to be able to scrape the pages only after those elements are added in?

I'm doing this currently with Node, and would prefer to keep using Node if possible.

Here is what I have:

const urls ['fake.com/link-1', 'fake.com/link-2', 'fake.com/link-3']

urls.forEach(url => {
  request(url)
    .then(function(html){
      //get dummy dom
      const d_dom = new JSDOM(html);
      ....
    }
});

Any thoughts on how to accomplish this? Or if there is currently an alternative to Selenium as an npm module?

Upvotes: 1

Views: 1408

Answers (1)

Get Off My Lawn
Get Off My Lawn

Reputation: 36351

You will want to use puppeteer which is a Chrome headless browser (owned and maintained by Chrome/Google) for loading and parsing dynamic web pages.

Use page.goto() to goto a specific page, then use page.content() to load the html content from the rendered page.

Here is an example of how to use it:

const { JSDOM } = require("jsdom");
const puppeteer = require('puppeteer')

const urls = ['fake.com/link-1', 'fake.com/link-2', 'fake.com/link-3']

urls.forEach(async url => {
  let dom = new JSDOM(await makeRequest(url))
  console.log(dom.window.document.title)
});

async function makeRequest(url) {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto(url);
  let html = await page.content()

  await browser.close();
  return html
}

Upvotes: 3

Related Questions