guylabbe.ca
guylabbe.ca

Reputation: 871

Memory usage with Node.js, Jsdom, HttpAgent

I have made a scrapping script that navigates through a blog in order to get all titles. Problem is that Node is keeping using more and more memory as the script runs (thousands of URLs), until 8 go (max), and then the script crashes.

My script uses loops, there must be a simple way to clear memory?

Here is a code example :

var request = require('request'),
httpAgent = require('http-agent'),
jsdom = require('jsdom').jsdom,
myWindow = jsdom().createWindow(),
$ = require('jquery'),
jq = require('jquery').create(),
jQuery = require('jquery').create(myWindow),
profiler = require('v8-profiler');

profiler.startProfiling();

request({ uri:'http://www.guylabbe.ca' }, function (error, response, body) {
  if (error && response.statusCode !== 200) {
    console.log('Error when contacting URL')
  }


        var last_page_lk = $(body).find('.pane-content .pager li:last-child a').attr('href');
        var nb_pages = last_page_lk.substring(last_page_lk.indexOf('=')+1);
        var page_lk_base = last_page_lk.substring(0,last_page_lk.indexOf('='));

        var pages = Array();
        pages.push(page_lk_base);
        for(var i=1;i<=nb_pages;i++) {
            pages.push(page_lk_base+'='+i);
        }


        // parser les pages

        var fiches = Array();
        var agent2 = httpAgent.create('www.guylabbe.ca', pages);

        agent2.addListener('next', function (err, agent2) {

            var snapshot = profiler.takeSnapshot();


            $(body).find('.view span.field-content span.views-field-title').each(function(){
                fiches.push($(body).find(this).parents('a').attr('href'));
                //console.log($(body).find(this).html());
            });


            agent2.next();

        });
        agent2.start();

        agent2.addListener('stop', function (agent) {
          console.log('-------------------------------- (fini de cumuler les URL fiches) --------------------------------');

            // Parser les fiches

            var agent_fiches = httpAgent.create('www.guylabbe.ca', fiches);

            agent_fiches.addListener('next', function (err, agent_fiches) {

                console.log('log info');


                agent_fiches.next();

            });
            agent_fiches.start();

            agent_fiches.addListener('stop', function (agent) {
              console.log('-------------------------------- Eh voilà! --------------------------------');
            });

            agent_fiches.addListener('start', function (agent) {
              console.log('-------------------------------- C est parti... --------------------------------');
            });

        });



});

Upvotes: 0

Views: 684

Answers (2)

drorw
drorw

Reputation: 677

I had a similar issue with jsdom leaking memory. In my case, closing the jsdom window by doing solved it. Maybe you should add myWindow.close() after you're done with scraping it. See related answer https://stackoverflow.com/a/6891729/1824928

Upvotes: 1

hereandnow78
hereandnow78

Reputation: 14434

explicitly null vars where you dont need them anymore. if you create variables outside a closure, and use it inside the closure, you should null it when you dont need it anymore. see this thread and read the accepted answer: How to prevent memory leaks in node.js?

Upvotes: 1

Related Questions