TobiV
TobiV

Reputation: 108

How can I speed up this nested foreach loop in Perl?

I have a Perl script that compares two sets of data loaded into two arrays, and I'm trying to make the comparison more efficient. Currently the code is as follows:

foreach(@{FILE_DATA}) {

    if((++$file_current_linenum % 200) == 0) {
        $progress = int($file_current_linenum / $file_total_lines * 10000) / 100;
        logg("Processed $file_current_linenum file rows, $progress%, $mismatches mismatches.");
    }


    $file_current_line = $_;

    $match_found = 0;

    foreach(@{DB_DATA}) {

        $db_current_line = $_;

        if($file_current_line->{"channel"} == $db_current_line->{"channel"} ) {

            if ($file_current_line->{"checksum"} == $db_current_line->{"checksum"} &&
                $file_current_line->{"time"} > ($db_current_line->{"date_part"} - $TIME_MATCH_TOLERANCE) && 
                $file_current_line->{"time"} < ($db_current_line->{"date_part"} + $TIME_MATCH_TOLERANCE) ){

                $match_found = 1;
                last; # break;
            }
        }
    }


    if($match_found != 1) {

        push(@results, $file_current_line);
        $mismatches++;

    }
}

My first thought would be to remove matches from both arrays to reduce the pool size, would that affect the iterators position?

Both sets of data can have up to a couple of million elements and the comparison can take a few hours to complete.

Upvotes: 1

Views: 296

Answers (2)

ikegami
ikegami

Reputation: 385645

Your solution is O(DB * FILE).

The following is O(DB + FILE) if and only if there never more than a few lines with the same channel and checksum:

my %DB_DATA;
for my $db_line (@DB_DATA) {
   push @{ $DB_DATA{ $db_line->{channel} }{ $db_line->{checksum} } }, $db_line;
}

for my $file_line_idx (0..$#FILE_DATA) {
   my $file_line = $FILE_DATA[$file_line_idx];
   my $found = 0;
   if (my $r1 = $DB_DATA{ $file_line->{channel} } ) {
      if (my $r2 = $r1->{ $file_line->{checksum} } ) {
         my $file_time = $file_line->{time};
         for my $db_line (@$r2) {
            my $db_time = $db_line->{date_part};
            if (abs($file_time - $db_time) < $TIME_MATCH_TOLERANCE) {
               $found = 1;
               last;
            }
         }
      }
   }

   push @mismatches, $file_line if !$found;

   if ((($file_line_idx+1) % 200) == 0) {
      logg(sprintf("Processed %d file rows, %d%, %d mismatches.",
         $file_line_idx+1,
         int(($file_line_idx+1)/@FILE_DATA) * 100,
         0+@mismatches,
      ));
   }
}

The following is O(DB + FILE) even if there are many lines with the same channel and checksum, but uses a lot of memory if $TIME_MATCH_TOLERANCE is big:

my %DB_DATA;
for my $db_line (@DB_DATA) {
   for my $db_time (
      $db_line->{date_part} - $TIME_MATCH_TOLERANCE + 1
        ..
      $db_line->{date_part} + $TIME_MATCH_TOLERANCE - 1
   ) {
      ++$DB_DATA{ $db_line->{channel} }{ $db_line->{checksum} }{$db_time};
   }
}

for my $file_line_idx (0..$#FILE_DATA) {
   my $file_line = $FILE_DATA[$file_line_idx];
   my $found = 0;
   if (my $r1 = $DB_DATA{ $file_line->{channel} } ) {
      if (my $r2 = $r1->{ $file_line->{checksum} } ) {
         if ($r2->{ $file_line->{time} } {
            $found = 1;
            last;
         }
      }
   }

   push @mismatches, $file_line if !$found;

   if ((($file_line_idx+1) % 200) == 0) {
      logg(sprintf("Processed %d file rows, %d%, %d mismatches.",
         $file_line_idx+1,
         int(($file_line_idx+1)/@FILE_DATA) * 100,
         0+@mismatches,
      ));
   }
}

Note: Assumes the timestamps are integers. If they're not, convert them to integers before using them as keys.


The following is O((DB + FILE) log DB) [ which is very close to O(DB + FILE) ] even if there are many lines with the same channel and checksum, and uses minimal memory:

sub binsearch(&\@) {
   my ($compare, $array) = @_;

   my $i = 0;
   my $j = $#$array;
   return 0 if $j == -1;

   while (1) {
      my $k = int(($i+$j)/2);
      for ($array->[$k]) {
         my $cmp = $compare->()
            or return 1;

         if ($cmp < 0) {
            $j = $k-1;
            return 0 if $i > $j;
         } else {
            $i = $k+1;
            return 0 if $i > $j;
         }
      }
   }
}

my %DB_DATA;
for my $db_line (@DB_DATA) {
   push @{ $DB_DATA{ $db_line->{channel} }{ $db_line->{checksum} } }, $db_line;
}

for my $r1 (values(%DB_DATA)) {
   for my $r2 (values(%$r1)) {
      @$r2 = sort { $a->{date_part} <=> $b->{date_part} } @$r2;
   }
}

for my $file_line_idx (0..$#FILE_DATA) {
   my $file_line = $FILE_DATA[$file_line_idx];
   my $found = 0;
   if (my $r1 = $DB_DATA{ $file_line->{channel} } ) {
      if (my $r2 = $r1->{ $file_line->{checksum} } ) {
         my $file_time = $file_line->{time};
         my $min_db_time = $file_time - $TIME_MATCH_TOLERANCE;
         my $max_db_time = $file_time + $TIME_MATCH_TOLERANCE;
         if ( binsearch {
              $_->{date_part} >= $max_db_time ? -1
            : $_->{date_part} <= $min_db_time ? +1
            : 0
         } @$r2 ) {
            $found = 1;
            last;
         }
      }
   }

   push @mismatches, $file_line if !$found;

   if ((($file_line_idx+1) % 200) == 0) {
      logg(sprintf("Processed %d file rows, %d%, %d mismatches.",
         $file_line_idx+1,
         int(($file_line_idx+1)/@FILE_DATA) * 100,
         0+@mismatches,
      ));
   }
}

Upvotes: 3

happydave
happydave

Reputation: 7187

You could probably reduce the time significantly by pre-building a hash from DB_DATA, using the concatenation of the "channel" and "checksum" values as the key, and each value being a list of all of the DB_DATA entries with that channel and checksum. That way, for each FILE_DATA entry, you only need to check that list.

If there are a lot of entries with a given channel and checksum, you try to improve even more by sorting them by date_part, and then trying to binary search to find a valid entry.

If there are very few entries with a given channel and checksum, this should reduce your run time by a factor of a million or so, since it reduces the run time from O($#FILE_DATA * $#DB_DATA) to O($#FILE_DATA + $#DB_DATA).

Upvotes: 0

Related Questions