kaytrance
kaytrance

Reputation: 2757

Syncronous fetch url in node.js?

Is there a way to get page source from specified url syncronously? The problem is that I have a long list of urls (like a 1000 of them) to fetch and parse, and doing it in a loop in a callback is quite painful, because it simultaneously start all fetchUrl functions, and parses it according to code in callback.

Preferebly I would like to be able to:

  1. Get url1
  2. Parse url1 source
  3. Save results parsing results to hdd
  4. Get url2
  5. Parse url2 source
  6. Save results parsing results to hdd
  7. .. repeat for all list.

Currently I use fetch package to get url source and cheerio for parsing.

Upvotes: 0

Views: 364

Answers (3)

Todd Yandell
Todd Yandell

Reputation: 14696

Sync I/O and Node don’t mix. If you really want to do this sync, you’re not gaining anything by using Node—it’s not even really possible. You could use Ruby instead.

The other answers are the correct way to do this on a production server. You should be submitting the requests to some kind of queue which can limit concurrency so you aren’t trying to make 1000 connections all at once. I like batch for this.

If this isn’t for production and you can use an unstable version of Node, you can get the sync-style syntax using co which uses generators to stop execution in the middle of a function via the yield keyword:

var co = require('co'),
    request = require('co-request'),
    cheerio = require('cheerio');

var urls = [];
for (var i = 0; i < 10; i++)
  urls.push('http://en.wikipedia.org/wiki/Special:Random');

co(function * () {
  for (var i = 0; i < urls.length; i++) {
    var res = yield request(urls[i]);
    console.log(cheerio.load(res.body)('#firstHeading').text());
  }
})();

Run with:

node --harmony-generators random.js

Or use regenerator:

regenerator -r random.js | node

Upvotes: 1

Gntem
Gntem

Reputation: 7155

using async.queue,request,cheerio here is a basic approach to your problem using async.queue

var Concurrency = 100; // how many urls to process at parallel

var mainQ =async.queue(function(url,callback){
 request(url,function(err,res,body){
   // do something with cheerio.
   // save to disk..
   console.log('%s - completed!',url);
   callback(); // end task
 });
},Concurrency);

mainQ.push(/* big array of 1000 urls */);

mainQ.drain=function(){
 console.log('Finished processing..');
};

Upvotes: 2

jfriend00
jfriend00

Reputation: 707158

Node's architecture and responsiveness as a web server depends upon it not doing synchronous (e.g. blocking) network operations. If you're going to develop in node.js, I'd suggest you learn how to manage asynchronous operations.

Here's a design pattern for running serialized async operations:

function processURLs(arrayOfURLs) {
    var i = 0;
    function next() {
        if (i < arrayOfURLs.length) {
            yourAsyncOperation(arrayOfURLS[i], function(result) {
                // this callback code runs when async operation is done
                // process result here

                // increment progress counter
                ++i;
                // do the next one
                next();
            });
        }
    }

    next();
}

For better end-to-end performance, you may actually want to let N async operations go at once rather than truly serialize them all.

You can also use promises or any of several async management libraries for node.js.

Upvotes: 1

Related Questions