Reputation: 1031
I'm writing a command line script in Node (because I know JS and suck at Bash + I need jQuery for navigating through DOM)… right now I'm reading an input file and I iterate over each line.
How do I go about making one HTTP request (GET) per line so that I can load the resulting string with jQuery and extract the information I need from each page?
I've tried using the NPM httpsync
package… so I could make one blocking GET call per line of my input file but it doesn't support HTTPS and of course the service I'm hitting only supports HTTPS.
Thanks!
Upvotes: 1
Views: 7440
Reputation: 1031
I was worried about making a million simultaneous requests without putting in some kind of throttling/limiting the number of concurrent connections, but it seems like Node is throttling me "out of the box" to something around 5-6 concurrent connections.
This is perfect, as it lets me keep my code a lot simpler while also fully leveraging the inherent asynchrony of Node.
Upvotes: 0
Reputation: 698
I would most likely use the async library's function eachLimit
function. That will allow you to throttle the number of active connections as well as getting a callback for when all the operations are done.
async.eachLimit(urls, function(url, done) {
request(url, function(err, res, body) {
// do something
done();
});
}, 5, function(err) {
// do something
console.log('all done!');
})
Upvotes: 1
Reputation: 144852
A good way to handle a large number of jobs in a conrolled manner is the async queue.
I also recommend you look at request for making HTTP requests and cheerio for dealing with the HTML you get.
Putting these together, you get something like:
var q = async.queue(function (task, done) {
request(task.url, function(err, res, body) {
if (err) return done(err);
if (res.statusCode != 200) return done(res.statusCode);
var $ = cheerio.load(body);
// ...
done();
});
}, 5);
Then add all your URLs to the queue:
q.push({ url: 'https://www.example.com/some/url' });
// ...
Upvotes: 5