ahajib
ahajib

Reputation: 13510

perl - using mysql - out of memory

I am trying to do a simple thing with perl but I have hard time doing it. I have to read data from mysql do something on the data and then print something. The problem is that the size of my database file is only 90MB and my system has 4GB of memory. Could anyone help me with this issue? I mean why is that?

Here is my code:

#!/usr/bin/env perl
use common::sense;
use File::Slurp;
use Text::SpeedyFx;
use DBI;
use DBD::mysql;
my $dbh = DBI->connect('dbi:mysql:enron','x','y') or die "Connection Error:     $DBI::errstr\n";

# Create cascades
my $sql = "select uniqueid,mid,sender,receiver,unixdate,body,seen from     filteredmessage";
my $sth = $dbh->prepare($sql);
$sth->execute or die "SQL Error: $DBI::errstr\n";

# Variables
my $row; 
my $uniqueId;
my $msg1;
my $sender;
my $receiver;
my $unixdate;

my $uniqueId2;
my $msg2;
my $sender2;
my $receiver2;
my $unixdate2;
my $dif;
my $row2;
my $sql2;
my $sth2;
my $j;

my $i=1;
while ($row = $sth->fetchrow_hashref) {
    my $flagSingleRow = 0;
    $uniqueId = $row->{'uniqueid'};
    $msg1 = $row->{'body'};
    $sender = $row->{'sender'};
    $receiver = $row->{'receiver'};
    $unixdate = $row->{'unixdate'};

    $sql2 = "select uniqueid,mid,sender,receiver,unixdate,body,seen from filteredmessage";
    $sth2 = $dbh->prepare($sql2);
    $sth2->execute or die "SQL Error: $DBI::errstr\n";

    $j=1;
    while ($row2 = $sth2->fetchrow_hashref) {
        $uniqueId2 = $row2->{'uniqueid'};
        $msg2 = $row2->{'body'};
        $sender2 = $row2->{'sender'};
        $receiver2 = $row2->{'receiver'};
        $unixdate2 = $row2->{'unixdate'};

        $dif = cosine_similarity($sfx->hash_fv($msg1, 8192),$sfx->hash_fv($msg2, 8192));
        #if($dif>0.5){
            print $i." ".$j." ". $dif."\n";
        #}
        $j++;
    }
    $i++;
} 

sub cosine_similarity {
    my ($a, $b) = @_;

    my $nbits_a = unpack(q(%32b*) => $a);
    my $nbits_b = unpack(q(%32b*) => $b);

    return $nbits_a * $nbits_b
        ? unpack(q(%32b*) => $a & $b) / sqrt $nbits_a * $nbits_b
        : 0;
}

Upvotes: 0

Views: 601

Answers (4)

Schwern
Schwern

Reputation: 165426

Text::SpeedyFx leaks memory badly. Here's a simple demonstration how hashing the same thing results in memory growth.

#! /usr/bin/perl

use v5.12;
use strict;
use warnings;

use Text::SpeedyFx;

my $sfx = Text::SpeedyFx->new;

for(1..1_000) {
    $sfx->hash_fv("12345", 8192);
}

say `ps auwx $$`;

for(1..10_000) {
    $sfx->hash_fv("12345", 8192);
}

say `ps auwx $$`;

for(1..100_000) {
    $sfx->hash_fv("12345", 8192);
}

say `ps auwx $$`;

for(1..1_000_000) {
    $sfx->hash_fv("12345", 8192);
}

say `ps auwx $$`;

$ perl ~/tmp/test.plx
USER      PID  %CPU %MEM      VSZ    RSS   TT  STAT STARTED      TIME COMMAND
schwern 81574  29.1  0.1  2463756   8412 s000  S+    1:58AM   0:00.21 perl /Users/schwern/tmp/test.plx

USER      PID  %CPU %MEM      VSZ    RSS   TT  STAT STARTED      TIME COMMAND
schwern 81574  54.4  0.2  2463756  18000 s000  S+    1:58AM   0:00.22 perl /Users/schwern/tmp/test.plx

USER      PID  %CPU %MEM      VSZ    RSS   TT  STAT STARTED      TIME COMMAND
schwern 81574  70.2  1.4  2553868 118384 s000  S+    1:58AM   0:00.35 perl /Users/schwern/tmp/test.plx

USER      PID  %CPU %MEM      VSZ    RSS   TT  STAT STARTED      TIME COMMAND
schwern 81574  98.5 13.4  3561496 1122320 s000  S+    1:58AM   0:01.49 perl /Users/schwern/tmp/test.plx

That works out to about 1K per call to hash_v. Since 1K is the bit vector size, it indicates SpeedyFx is storing every call to hash_v separately rather than reusing the same slot for the same octets.

I have reported this as a bug. Meanwhile, to mitigate the problem you can initialize the Text::SpeedyFx object inside the outer loop.

Upvotes: 2

ahajib
ahajib

Reputation: 13510

It seems that the main problem with my program is using $sfx->hash_fv. I have to come up with some ideas to improve this function. This is not my own function and it comes from the SpeedyFx package. Thank you guys anyway.

Upvotes: 0

Apprentice Queue
Apprentice Queue

Reputation: 2036

Try calling $sth2->finish before you reuse $sth2 for a new query. (eg,

    }
    $sth2->finish;
    $i++;
}

)

Also my second comment is that your function cosine_similarity is symmetric (ie, cosine_similarity(x,y) = cosine_similarity(y,x)). So you could cut down almost 50% of your time if you add an ORDER BY uniqueid to $sql and make the appropriate change to $sql2 to only fetch rows where uniqueid >= $uniqueId.

Upvotes: 1

Nikhil
Nikhil

Reputation: 2308

Would it be possible to use 'fetchrow_arrayref' instead of 'fetchrow_hashref' on both the statement handles? You will then have to get your values using specific array indices rather than hash keys.

while ($row = $sth->fetchrow_arrayref) {
    my $flagSingleRow = 0;
    $uniqueId = $row->[0];
    $msg1 = $row->[1];
    $sender = $row->[2];
    $receiver = $row->[3];
    $unixdate = $row->[4];
...
# and similarly for $sth2

The actual position of each field in the array depends on the order of columns in the select statement. Give this a shot and let us know the results.

Upvotes: 0

Related Questions