Reputation: 4166
Has anybody used Node Cheerio to scrape an entire site and not just the home / first page the scraper gets pointed to?
At the minute I'm doing the following which only scrapes the target page.
request('http://arandomsite.com/', function (error, response, html) {
if (!error && response.statusCode == 200){
var $ = cheerio.load(html);
...
...
...
};
Upvotes: 0
Views: 1687
Reputation: 2652
I have never used Cheerio, but I would assume (as with may other scrapers), it will only do the page you point it to. Assuming the cheerio.load returns a jquery like api, you would probably have to do something like
$('a').each(function(index, a) {
//TODO: You may want to keep track here of which you have done, and not redo any.
request('http://arandomsite.com' + a.attr('href'), myPageProcessFunction);
});
Obviously you would need to add things like iframes as well to make sure you get a complete result.
In order to clarify, here is some updated code:
request('http://arandomsite.com/', function responseFunction(error, response, html) {
if (!error && response.statusCode == 200){
var $ = cheerio.load(html);
$('a').each(function(index, a) {
request('http://arandomsite.com' + a.attr('href'), responseFunction);
});
};
});
Upvotes: 1