cookersjs
cookersjs

Reputation: 121

Memory leak when performing repeated queries using WWW::Mechanize

I am trying to find a memory leak in my program. I have found where the leak originates but I cannot fix it.

The program reads each gene page that is connected to each chromosome as found on Wikipedia Genes by human chromosome

The program extracts the information I am interested in on each gene page, moves onto the next gene page and so on.

Once it reaches the end of the gene list of the current chromosome, it moves onto the next chromosome until it has gone through each page.

The program worked on my computer until about 2-3 weeks ago. Since then it started to have this problem.

I have been monitoring using top and there is a distinct increase in memory usage as the program goes along until it reaches a critical point and my computer crashes.

As per request I am providing code that can be compiled. I have started it at Chromosome 21 since that one has the fewest number of pages and will therefore take the least amount of time to get through. The memory usage still increases incrementally in this code snippet so hopefully this is enough! Also, the eval statements are in there since querying the wikipedia API sometimes returns nothing instead of the expected JSON. The eval function allowed me to get around this without letting the program die

My (Updated) code

#!/usr/bin/env perl -w

use common::sense;
use WWW::Mechanize;
use URI;
use HTTP::Request;
use Cpanel::JSON::XS qw(decode_json);

my ( $self, $registry ) = @_;

my $mech = WWW::Mechanize->new();

my $root = URI->new("http://en.wikipedia.org/w/api.php");

my $url = $root->clone();

for my $i ( 21 .. 25 ) {
    my $chrom = $i;
    if ( $chrom == 23 ) {
        $chrom = "M";
    }
    elsif ( $chrom == 24 ) {
        $chrom = "Y";
    }
    elsif ( $chrom == 25 ) {
        $chrom = "X";
    }
    print "Hi!\n The chromosome is $chrom\n";

    my $query = {
        action     => 'query',
        format     => 'json',
        list       => 'categorymembers',
        cmtitle    => "Category:Genes on human chromosome $chrom",
        cmlimit    => 'max',
        cmcontinue => ''
    };

    $url->query_form($query);

    my @gene_pages = ();
    eval {
        while ( my $response = $mech->get($url) ) {
            my $perl_scalar = decode_json( $response->decoded_content() )
                ;    #J Source of malformed JSON string error
            push @gene_pages, @{ $perl_scalar->{query}->{categorymembers} };
            my $count = @gene_pages;

            # Adapted code to new format for continuing queries

            if ( $perl_scalar->{continue} ) {
                $query->{cmcontinue} = $perl_scalar->{continue}->{cmcontinue};
                $url->query_form($query);
            }
            else {
                last;
            }
        }
    };
    if ( $@ =~ /malformed/ ) {
        redo;
    }
    my $gene_count = 0;
    eval {
        foreach my $gene_page (@gene_pages) {
            $gene_count++;
            my $url   = $root->clone();
            my $query = {
                action  => 'query',
                prop    => 'revisions',
                format  => 'json',
                rvprop  => 'content|tags|timestamp',
                pageids => $gene_page->{pageid}
            };
            $url->query_form($query);

            #       $log->debugf("Requesting: %s", $url->as_string());
            my $response    = $mech->get($url);
            my $content     = $response->decoded_content();
            my $perl_scalar = decode_json( $response->decoded_content() )
                ;    #J Source of malformed JSON string error
            if ( $gene_count % 10 == 0 ) {
                print "$gene_count gene pages complete\n";
            }
        }
    };
    print "There were $gene_count genes found for chromosome $chrom\n";

}

This code has a much larger component but I have excluded it because this is the area that I know has the source of the issue.

The while loop part that uses WWW::Mechanize

my $response = $mech->get($url)

is connected to the memory leak.

If I remove that component and run the program the memory use stays around the same and then adding it back in shows the memory rise incrementally once again.

Perl version: 5.24.1

System: Ubuntu 16.04

Edit: @Borodin Thank you for such a thorough reply! Unfortunately I am still noticing a memory leak on my computer which is making me wonder if there is a larger problem beyond this.

It still incrementally takes up memory and for now my computer is OK with it but when I run the full program that includes some web scraping, I don't know that my computer will be able suffice.

On a potentially related note -- My computer has a weird issue where it sometimes is unable to download files fully (files are truncated despite the download being complete). When I was running your program I got this error a lot:

**unexpected end of string while parsing JSON string, at character offset 5506 (before "(end of string)") **

It seems like it could be related to that issue I am having and I wonder if this contributes to the memory leak problem?

Upvotes: 1

Views: 218

Answers (1)

Borodin
Borodin

Reputation: 126742

You don't use any part of WWW::Mechanize that LWP::UserAgent doesn't provide, so I recommend that you defer to the latter

Here is some working code that does pretty much the same as your own program. It doesn't exhibit any memory leakage for me

Please ask if you need anything explained; there is too much content to go through the entire program

#!/usr/bin/env perl

use strict;
use warnings 'all';

use URI;
use URI::QueryParam;
use LWP;
use JSON::XS qw(decode_json);

STDOUT->autoflush;

my $api_root = URI->new( 'http://en.wikipedia.org/w/api.php' );

my @chromosomes = ( 1 .. 22, qw/ M Y X/ );

my $ua = LWP::UserAgent->new;

for my $chrom ( @chromosomes[20..$#chromosomes] ) {

    #print "The chromosome is $chrom\n";

    my $query = {
        action  => 'query',
        format  => 'json',
        list    => 'categorymembers',
        cmtitle => "Category:Genes on human chromosome $chrom",
        cmlimit => 'max',
    };

    my $url = $api_root->clone;
    $url->query_form( $query );

    my @gene_pages;

    while () {

        my $resp = $ua->get( $url );
        die $resp->status_line unless $resp->is_success;

        # J Source of malformed JSON string error
        my $data     = decode_json( $resp->decoded_content );
        my $query    = $data->{query};
        my $continue = $data->{continue};

        push @gene_pages, @{ $query->{categorymembers} };

        # Adapted code to new format for continuing queries
        last unless $continue;

        $url->query_param( cmcontinue => $continue->{cmcontinue} );
    }

    printf "Processing %d gene pages for chromosome %s\n",
            scalar @gene_pages,
            $chrom;

    my $gene_count;

    for my $gene_page ( @gene_pages ) {

        ++$gene_count;

        my $url = $api_root->clone;

        my $query = {
            action  => 'query',
            prop    => 'revisions',
            format  => 'json',
            rvprop  => 'content|tags|timestamp',
            pageids => $gene_page->{pageid}
        };

        $url->query_form( $query );

        # print "Requesting: $url\n";

        my $resp = $ua->get( $url );

        die $resp->status_line unless $resp->is_success;

        my $content = $resp->decoded_content;
        my $data    = decode_json( $content );    # J Source of malformed JSON string error

        print "$gene_count gene pages complete\n" unless $gene_count % 10;
    }

    print "There were $gene_count genes found for chromosome $chrom\n";
}

Upvotes: 1

Related Questions