leaksterrr
leaksterrr

Reputation: 4166

Node Cheerio to scrape an entire site

Has anybody used Node Cheerio to scrape an entire site and not just the home / first page the scraper gets pointed to?

At the minute I'm doing the following which only scrapes the target page.

request('http://arandomsite.com/', function (error, response, html) {
    if (!error && response.statusCode == 200){
        var $ = cheerio.load(html);
            ...
            ...
            ...
};

Upvotes: 0

Views: 1687

Answers (1)

major-mann
major-mann

Reputation: 2652

I have never used Cheerio, but I would assume (as with may other scrapers), it will only do the page you point it to. Assuming the cheerio.load returns a jquery like api, you would probably have to do something like

$('a').each(function(index, a) {
    //TODO: You may want to keep track here of which you have done, and not redo any.
    request('http://arandomsite.com' + a.attr('href'), myPageProcessFunction);
});

Obviously you would need to add things like iframes as well to make sure you get a complete result.

In order to clarify, here is some updated code:

request('http://arandomsite.com/', function responseFunction(error, response, html) {
if (!error && response.statusCode == 200){
    var $ = cheerio.load(html);
    $('a').each(function(index, a) {
        request('http://arandomsite.com' + a.attr('href'), responseFunction);
    });
};
});

Upvotes: 1

Related Questions