Reputation: 2757
Is there a way to get page source from specified url syncronously? The problem is that I have a long list of urls (like a 1000 of them) to fetch and parse, and doing it in a loop in a callback is quite painful, because it simultaneously start all fetchUrl functions, and parses it according to code in callback.
Preferebly I would like to be able to:
Currently I use fetch package to get url source and cheerio for parsing.
Upvotes: 0
Views: 364
Reputation: 14696
Sync I/O and Node don’t mix. If you really want to do this sync, you’re not gaining anything by using Node—it’s not even really possible. You could use Ruby instead.
The other answers are the correct way to do this on a production server. You should be submitting the requests to some kind of queue which can limit concurrency so you aren’t trying to make 1000 connections all at once. I like batch for this.
If this isn’t for production and you can use an unstable version of Node, you can get the sync-style syntax using co which uses generators to stop execution in the middle of a function via the yield
keyword:
var co = require('co'),
request = require('co-request'),
cheerio = require('cheerio');
var urls = [];
for (var i = 0; i < 10; i++)
urls.push('http://en.wikipedia.org/wiki/Special:Random');
co(function * () {
for (var i = 0; i < urls.length; i++) {
var res = yield request(urls[i]);
console.log(cheerio.load(res.body)('#firstHeading').text());
}
})();
Run with:
node --harmony-generators random.js
Or use regenerator:
regenerator -r random.js | node
Upvotes: 1
Reputation: 7155
using async.queue
,request
,cheerio
here is a basic approach to your problem using async.queue
var Concurrency = 100; // how many urls to process at parallel
var mainQ =async.queue(function(url,callback){
request(url,function(err,res,body){
// do something with cheerio.
// save to disk..
console.log('%s - completed!',url);
callback(); // end task
});
},Concurrency);
mainQ.push(/* big array of 1000 urls */);
mainQ.drain=function(){
console.log('Finished processing..');
};
Upvotes: 2
Reputation: 707158
Node's architecture and responsiveness as a web server depends upon it not doing synchronous (e.g. blocking) network operations. If you're going to develop in node.js, I'd suggest you learn how to manage asynchronous operations.
Here's a design pattern for running serialized async operations:
function processURLs(arrayOfURLs) {
var i = 0;
function next() {
if (i < arrayOfURLs.length) {
yourAsyncOperation(arrayOfURLS[i], function(result) {
// this callback code runs when async operation is done
// process result here
// increment progress counter
++i;
// do the next one
next();
});
}
}
next();
}
For better end-to-end performance, you may actually want to let N async operations go at once rather than truly serialize them all.
You can also use promises or any of several async management libraries for node.js.
Upvotes: 1