Reputation: 2828
Trying to scrape data from page that templates in browser with a lot of JS. And when playing with jsdom can't get any data, maybe page doesn't have enough time to load or render. How to scrape data in this case: use timer or download all page by request
jsdom.env({
url: link,
scripts: ["http://code.jquery.com/jquery.js"],
done: function (errors, window) {
var $ = window.$;
var date = $('.date').text();
console.log(date);
}
});
Upvotes: 1
Views: 345
Reputation: 1692
A colleague of mine has a PhantomJS-based project doing just that: https://github.com/vmeurisse/phantomCrawl.
He has a simple example that looks a lot like your snippet:
'use strict';
var PhantomCrawl = require('./src/PhantomCrawl');
var urls = [];
urls.push('http://www.bing.com');
var ptc = new PhantomCrawl({
urls: urls,
nbThreads: 4,
crawlerPerThread: 4,
maxDepth: 1
});
urls
is the list of urls to crawl.
nbThreads
is the number of instances of PhantomJS launched.
crawlerPerThread
is the number of pages crawled in parallel per instance of PhantomJS.
maxDepth
is the number of times the currently crawled page follows links present in the page.
Upvotes: 3