Reputation: 108
I have a Perl script that compares two sets of data loaded into two arrays, and I'm trying to make the comparison more efficient. Currently the code is as follows:
foreach(@{FILE_DATA}) {
if((++$file_current_linenum % 200) == 0) {
$progress = int($file_current_linenum / $file_total_lines * 10000) / 100;
logg("Processed $file_current_linenum file rows, $progress%, $mismatches mismatches.");
}
$file_current_line = $_;
$match_found = 0;
foreach(@{DB_DATA}) {
$db_current_line = $_;
if($file_current_line->{"channel"} == $db_current_line->{"channel"} ) {
if ($file_current_line->{"checksum"} == $db_current_line->{"checksum"} &&
$file_current_line->{"time"} > ($db_current_line->{"date_part"} - $TIME_MATCH_TOLERANCE) &&
$file_current_line->{"time"} < ($db_current_line->{"date_part"} + $TIME_MATCH_TOLERANCE) ){
$match_found = 1;
last; # break;
}
}
}
if($match_found != 1) {
push(@results, $file_current_line);
$mismatches++;
}
}
My first thought would be to remove matches from both arrays to reduce the pool size, would that affect the iterators position?
Both sets of data can have up to a couple of million elements and the comparison can take a few hours to complete.
Upvotes: 1
Views: 296
Reputation: 385645
Your solution is O(DB * FILE).
The following is O(DB + FILE) if and only if there never more than a few lines with the same channel and checksum:
my %DB_DATA;
for my $db_line (@DB_DATA) {
push @{ $DB_DATA{ $db_line->{channel} }{ $db_line->{checksum} } }, $db_line;
}
for my $file_line_idx (0..$#FILE_DATA) {
my $file_line = $FILE_DATA[$file_line_idx];
my $found = 0;
if (my $r1 = $DB_DATA{ $file_line->{channel} } ) {
if (my $r2 = $r1->{ $file_line->{checksum} } ) {
my $file_time = $file_line->{time};
for my $db_line (@$r2) {
my $db_time = $db_line->{date_part};
if (abs($file_time - $db_time) < $TIME_MATCH_TOLERANCE) {
$found = 1;
last;
}
}
}
}
push @mismatches, $file_line if !$found;
if ((($file_line_idx+1) % 200) == 0) {
logg(sprintf("Processed %d file rows, %d%, %d mismatches.",
$file_line_idx+1,
int(($file_line_idx+1)/@FILE_DATA) * 100,
0+@mismatches,
));
}
}
The following is O(DB + FILE) even if there are many lines with the same channel and checksum, but uses a lot of memory if $TIME_MATCH_TOLERANCE
is big:
my %DB_DATA;
for my $db_line (@DB_DATA) {
for my $db_time (
$db_line->{date_part} - $TIME_MATCH_TOLERANCE + 1
..
$db_line->{date_part} + $TIME_MATCH_TOLERANCE - 1
) {
++$DB_DATA{ $db_line->{channel} }{ $db_line->{checksum} }{$db_time};
}
}
for my $file_line_idx (0..$#FILE_DATA) {
my $file_line = $FILE_DATA[$file_line_idx];
my $found = 0;
if (my $r1 = $DB_DATA{ $file_line->{channel} } ) {
if (my $r2 = $r1->{ $file_line->{checksum} } ) {
if ($r2->{ $file_line->{time} } {
$found = 1;
last;
}
}
}
push @mismatches, $file_line if !$found;
if ((($file_line_idx+1) % 200) == 0) {
logg(sprintf("Processed %d file rows, %d%, %d mismatches.",
$file_line_idx+1,
int(($file_line_idx+1)/@FILE_DATA) * 100,
0+@mismatches,
));
}
}
Note: Assumes the timestamps are integers. If they're not, convert them to integers before using them as keys.
The following is O((DB + FILE) log DB) [ which is very close to O(DB + FILE) ] even if there are many lines with the same channel and checksum, and uses minimal memory:
sub binsearch(&\@) {
my ($compare, $array) = @_;
my $i = 0;
my $j = $#$array;
return 0 if $j == -1;
while (1) {
my $k = int(($i+$j)/2);
for ($array->[$k]) {
my $cmp = $compare->()
or return 1;
if ($cmp < 0) {
$j = $k-1;
return 0 if $i > $j;
} else {
$i = $k+1;
return 0 if $i > $j;
}
}
}
}
my %DB_DATA;
for my $db_line (@DB_DATA) {
push @{ $DB_DATA{ $db_line->{channel} }{ $db_line->{checksum} } }, $db_line;
}
for my $r1 (values(%DB_DATA)) {
for my $r2 (values(%$r1)) {
@$r2 = sort { $a->{date_part} <=> $b->{date_part} } @$r2;
}
}
for my $file_line_idx (0..$#FILE_DATA) {
my $file_line = $FILE_DATA[$file_line_idx];
my $found = 0;
if (my $r1 = $DB_DATA{ $file_line->{channel} } ) {
if (my $r2 = $r1->{ $file_line->{checksum} } ) {
my $file_time = $file_line->{time};
my $min_db_time = $file_time - $TIME_MATCH_TOLERANCE;
my $max_db_time = $file_time + $TIME_MATCH_TOLERANCE;
if ( binsearch {
$_->{date_part} >= $max_db_time ? -1
: $_->{date_part} <= $min_db_time ? +1
: 0
} @$r2 ) {
$found = 1;
last;
}
}
}
push @mismatches, $file_line if !$found;
if ((($file_line_idx+1) % 200) == 0) {
logg(sprintf("Processed %d file rows, %d%, %d mismatches.",
$file_line_idx+1,
int(($file_line_idx+1)/@FILE_DATA) * 100,
0+@mismatches,
));
}
}
Upvotes: 3
Reputation: 7187
You could probably reduce the time significantly by pre-building a hash from DB_DATA, using the concatenation of the "channel" and "checksum" values as the key, and each value being a list of all of the DB_DATA entries with that channel and checksum. That way, for each FILE_DATA entry, you only need to check that list.
If there are a lot of entries with a given channel and checksum, you try to improve even more by sorting them by date_part, and then trying to binary search to find a valid entry.
If there are very few entries with a given channel and checksum, this should reduce your run time by a factor of a million or so, since it reduces the run time from O($#FILE_DATA * $#DB_DATA) to O($#FILE_DATA + $#DB_DATA).
Upvotes: 0