ezdin gharbi
ezdin gharbi

Reputation: 19

delaying requests using request and cheerio modules

So this is the code I used to crawl my pages (i'm using request and cheerio modules:

for (let j = 1; j < nbRequest; j++)
{
  const currentPromise = new Promise((resolve, reject) => {
    request(
      `https://www.url${j}`,
      (error, response, body) => {
        if (error || !response) {
          console.log("Error: " + error);
        }

    console.log("Status code: " + response.statusCode + ", Connected to the page");

    var $ = cheerio.load(body);
    let output = {
      ranks: [],
      names: [],
      numbers: [],
    };

    $('td.rangCell').each(function( index ) {
      if ($(this).text().trim() != "Rang")
      {
        output.ranks.push($(this).text().trim().slice(0, -1));
        nbRanks = nb_ranks+1;
      }

    });

    $('td.nameCell:has(label)').each(function( index ) {
      output.names.push($(this).find('label.nameValue > a').text().trim());
    });

    $('td.numberCell').each(function( index ) {
      if ($(this).text().trim() != "Nombre")
      {
        output.numbers.push($(this).text().trim());
      }
    });

    console.log("HERE 1");
    return resolve(output);
  }
);


 });
    promises.push(currentPromise);
   }

after that I'm parsing and saving the result in a csv file using a node module. At this point i've been able to crawl about 100 pages, but when it comes to much bigger numbers (1000+) I'm receiving a 500 response meaning that i'm being kicked i think. So i think the best solution is to delay requests, but i didn't find the solution. Do you guys have any idea and how the code would look like ?

Upvotes: 1

Views: 1051

Answers (1)

what you are looking for is called "Control Flow", you can achieve this by using async.queue for example.

If you add every request to the the queue you can control the amount of parallel requests with the amount of workers. And you could add setTimeouts to the final part of the request's callback to achieve the delaying of requests.

Additionally I'd suggest using a "crawler" package (instead of building your own) e.g. npm-crawler as they ship with build in rate-limiting and have already taken care of other things that you might face next :) e.g. user-agent pool

Update:

const async = require("async");
const delayTime = 1500; //wait 1,5 seconds after every new request

getRequestPromise(csvLine){
 return new Promise( make you request here );
}

const asyncQueue = async.queue(function(task, callback) {
 getRequestPromise(task).then(_ => {
  setTimeout(() => {
   callback(null);
  }, delayTime);
 });
}, 1); //1 one request at a time

for(csv){ //pseudo
 asyncQueue.push(csv[i], () => {});
}

asyncQueue.drain = () => {
 console.log("finished.");
};

Upvotes: 1

Related Questions