Scrape data from site with browser-based Template Engine

Question

Trying to scrape data from page that templates in browser with a lot of JS. And when playing with jsdom can't get any data, maybe page doesn't have enough time to load or render. How to scrape data in this case: use timer or download all page by request

jsdom.env({
  url: link,
  scripts: ["http://code.jquery.com/jquery.js"],
  done: function (errors, window) {
    var $ = window.$;
    var date = $('.date').text();
    console.log(date);
  }
});

Stilltorik · Accepted Answer

A colleague of mine has a PhantomJS-based project doing just that: https://github.com/vmeurisse/phantomCrawl.

He has a simple example that looks a lot like your snippet:

'use strict';

var PhantomCrawl = require('./src/PhantomCrawl');

var urls = [];

urls.push('http://www.bing.com');
var ptc = new PhantomCrawl({
    urls: urls,
    nbThreads: 4,
    crawlerPerThread: 4,
    maxDepth: 1
});

urls is the list of urls to crawl.

nbThreads is the number of instances of PhantomJS launched.

crawlerPerThread is the number of pages crawled in parallel per instance of PhantomJS.

maxDepth is the number of times the currently crawled page follows links present in the page.

Scrape data from site with browser-based Template Engine

Answers (1)

Related Questions