khex
khex

Reputation: 2828

Scrape data from site with browser-based Template Engine

Trying to scrape data from page that templates in browser with a lot of JS. And when playing with jsdom can't get any data, maybe page doesn't have enough time to load or render. How to scrape data in this case: use timer or download all page by request

jsdom.env({
  url: link,
  scripts: ["http://code.jquery.com/jquery.js"],
  done: function (errors, window) {
    var $ = window.$;
    var date = $('.date').text();
    console.log(date);
  }
});

Upvotes: 1

Views: 345

Answers (1)

Stilltorik
Stilltorik

Reputation: 1692

A colleague of mine has a PhantomJS-based project doing just that: https://github.com/vmeurisse/phantomCrawl.

He has a simple example that looks a lot like your snippet:

'use strict';

var PhantomCrawl = require('./src/PhantomCrawl');

var urls = [];

urls.push('http://www.bing.com');
var ptc = new PhantomCrawl({
    urls: urls,
    nbThreads: 4,
    crawlerPerThread: 4,
    maxDepth: 1
});

urls is the list of urls to crawl.

nbThreads is the number of instances of PhantomJS launched.

crawlerPerThread is the number of pages crawled in parallel per instance of PhantomJS.

maxDepth is the number of times the currently crawled page follows links present in the page.

Upvotes: 3

Related Questions