Reputation: 13510
I am trying to do a simple thing with perl but I have hard time doing it. I have to read data from mysql do something on the data and then print something. The problem is that the size of my database file is only 90MB and my system has 4GB of memory. Could anyone help me with this issue? I mean why is that?
Here is my code:
#!/usr/bin/env perl
use common::sense;
use File::Slurp;
use Text::SpeedyFx;
use DBI;
use DBD::mysql;
my $dbh = DBI->connect('dbi:mysql:enron','x','y') or die "Connection Error: $DBI::errstr\n";
# Create cascades
my $sql = "select uniqueid,mid,sender,receiver,unixdate,body,seen from filteredmessage";
my $sth = $dbh->prepare($sql);
$sth->execute or die "SQL Error: $DBI::errstr\n";
# Variables
my $row;
my $uniqueId;
my $msg1;
my $sender;
my $receiver;
my $unixdate;
my $uniqueId2;
my $msg2;
my $sender2;
my $receiver2;
my $unixdate2;
my $dif;
my $row2;
my $sql2;
my $sth2;
my $j;
my $i=1;
while ($row = $sth->fetchrow_hashref) {
my $flagSingleRow = 0;
$uniqueId = $row->{'uniqueid'};
$msg1 = $row->{'body'};
$sender = $row->{'sender'};
$receiver = $row->{'receiver'};
$unixdate = $row->{'unixdate'};
$sql2 = "select uniqueid,mid,sender,receiver,unixdate,body,seen from filteredmessage";
$sth2 = $dbh->prepare($sql2);
$sth2->execute or die "SQL Error: $DBI::errstr\n";
$j=1;
while ($row2 = $sth2->fetchrow_hashref) {
$uniqueId2 = $row2->{'uniqueid'};
$msg2 = $row2->{'body'};
$sender2 = $row2->{'sender'};
$receiver2 = $row2->{'receiver'};
$unixdate2 = $row2->{'unixdate'};
$dif = cosine_similarity($sfx->hash_fv($msg1, 8192),$sfx->hash_fv($msg2, 8192));
#if($dif>0.5){
print $i." ".$j." ". $dif."\n";
#}
$j++;
}
$i++;
}
sub cosine_similarity {
my ($a, $b) = @_;
my $nbits_a = unpack(q(%32b*) => $a);
my $nbits_b = unpack(q(%32b*) => $b);
return $nbits_a * $nbits_b
? unpack(q(%32b*) => $a & $b) / sqrt $nbits_a * $nbits_b
: 0;
}
Upvotes: 0
Views: 601
Reputation: 165426
Text::SpeedyFx leaks memory badly. Here's a simple demonstration how hashing the same thing results in memory growth.
#! /usr/bin/perl
use v5.12;
use strict;
use warnings;
use Text::SpeedyFx;
my $sfx = Text::SpeedyFx->new;
for(1..1_000) {
$sfx->hash_fv("12345", 8192);
}
say `ps auwx $$`;
for(1..10_000) {
$sfx->hash_fv("12345", 8192);
}
say `ps auwx $$`;
for(1..100_000) {
$sfx->hash_fv("12345", 8192);
}
say `ps auwx $$`;
for(1..1_000_000) {
$sfx->hash_fv("12345", 8192);
}
say `ps auwx $$`;
$ perl ~/tmp/test.plx
USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND
schwern 81574 29.1 0.1 2463756 8412 s000 S+ 1:58AM 0:00.21 perl /Users/schwern/tmp/test.plx
USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND
schwern 81574 54.4 0.2 2463756 18000 s000 S+ 1:58AM 0:00.22 perl /Users/schwern/tmp/test.plx
USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND
schwern 81574 70.2 1.4 2553868 118384 s000 S+ 1:58AM 0:00.35 perl /Users/schwern/tmp/test.plx
USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND
schwern 81574 98.5 13.4 3561496 1122320 s000 S+ 1:58AM 0:01.49 perl /Users/schwern/tmp/test.plx
That works out to about 1K per call to hash_v. Since 1K is the bit vector size, it indicates SpeedyFx is storing every call to hash_v separately rather than reusing the same slot for the same octets.
I have reported this as a bug. Meanwhile, to mitigate the problem you can initialize the Text::SpeedyFx object inside the outer loop.
Upvotes: 2
Reputation: 13510
It seems that the main problem with my program is using $sfx->hash_fv. I have to come up with some ideas to improve this function. This is not my own function and it comes from the SpeedyFx package. Thank you guys anyway.
Upvotes: 0
Reputation: 2036
Try calling $sth2->finish
before you reuse $sth2
for a new query. (eg,
}
$sth2->finish;
$i++;
}
)
Also my second comment is that your function cosine_similarity
is symmetric (ie, cosine_similarity(x,y) = cosine_similarity(y,x)
). So you could cut down almost 50% of your time if you add an ORDER BY uniqueid
to $sql
and make the appropriate change to $sql2
to only fetch rows where uniqueid >= $uniqueId
.
Upvotes: 1
Reputation: 2308
Would it be possible to use 'fetchrow_arrayref' instead of 'fetchrow_hashref' on both the statement handles? You will then have to get your values using specific array indices rather than hash keys.
while ($row = $sth->fetchrow_arrayref) {
my $flagSingleRow = 0;
$uniqueId = $row->[0];
$msg1 = $row->[1];
$sender = $row->[2];
$receiver = $row->[3];
$unixdate = $row->[4];
...
# and similarly for $sth2
The actual position of each field in the array depends on the order of columns in the select statement. Give this a shot and let us know the results.
Upvotes: 0