Ragnar
Ragnar

Reputation: 2690

NodeJS Good way to chain method to scrape a slow web site

Can not get access to the database so I have to scrape a web site to get his data back. The site and server is poorly developed so some page made up to > 10s to render.

I use Node and request-promise to get the html and cheerio to build a JS Object that I want to convert to a JSON file. I have to loop over the all the day of serval year in the params of the URL (doing it for January 2016 to test first).

My problem is Node use async event. So all my loop are triggered at the same time as all the request are made (around 100ms so it's like instant). The web site cannot handle this so I started to get the first html then 500 error.

What I plan to do is to wait to fully scrape on iteration before calling the next request (to let the poor server breath a bit).

Like so :

Enter the loop => request => get html back (10s) => scrape it => write on the disk => i++ ; Enter the loop => ...

and not doing it async.

Here a bunch of my code :

var rp = require('request-promise')
var cheerio = require('cheerio')

[...]

console.log('Start 💀');

let array = []

for (var year = 2016; year < 2017; year++) {

for (var month = 1; month <= 1; month++) {

for (var day = 1; day <= 31; day++) {

const options = {
  url : 'http://myurl',
  Cookie: cookie,
  transform: function (body) {
        return cheerio.load(body);
    }
}

let data

rp(options)
    .then(function ($) {
       => My Scraping stuff return in data
    })
    .catch(function (err) {
        // Crawling failed or Cheerio choked...
    })
    .pipe(fs.writeFile(`./data/${timestamp}.json`, JSON.stringify(data), function(err) {
          if (err) {
            console.log(err);
          }
          console.log(`😈 File successfully written! - ${timestamp}`)
        })

        }
    }
}

If I tweak the loop to work for like 2 or 3 days only everything is going well.

Upvotes: 0

Views: 603

Answers (1)

Alex
Alex

Reputation: 4276

I use: Crawler package. It work very good for me :)

Upvotes: 1

Related Questions