Vibes
Vibes

Reputation: 33

To increase the performance of a script in perl

I have 2 files here which is newFile and LookupFile (which are huge files). The contents in newFile will be searched in LookupFile and further processing happens. This script is working fine, however, it is taking more time to execute. Could you please let me know what can be done here to increase the performance? Could you please let me know if we can convert files into hash to increase performance?

My file looks like below

NewFile and LookupFile:

acl sourceipaddress subnet destinationipaddress subnet portnumber . .

Script:

    #!/usr/bin/perl 
    use strict;
    use warnings; 
    use File::Slurp::Tiny 'read_file'; 
    use File::Copy; 
    use Data::Dumper; 
    use File::Copy qw(copy); 
    my %options = (
            LookupFile => {
                type => "=s",
                help => "File name",
                variable => 'gitFile',
                required => 1,
             },    newFile => {
                type => "=s",
                help => "file containing the acl lines to checked for",
                variable => ‘newFile’,
                required => 1,
             }  );

                $opts->addOptions(%options); 
                $opts->parse(); 
                $opts->validate();
        my $newFile = $opts->getOption('newFile'); 
        my $LookupFile = $opts->getOption('LookupFile');

    my @LookupFile = read_file ("$LookupFile");
    my @newFile = read_file ("$newFile"); 
    @LookupFile = split (/\n/,$LookupFile[0]);
    @newLines = split (/\n/,$newFile[0]);
    open FILE1, "$newFile" or die "Could not open file: $! \n";

while(my $line = <FILE1>)
    {
        chomp($line);
        my @columns = split(' ',$line);
        $var = @columns;
        my $fld1;
        my $cnt;
        my $fld2;
        my $fld3;
        my $fld4;
        my $fld5;
        my $dIP;
        my $sIP;
        my $sHOST;
        my $dHOST;
       if(....)
         if (....) further checks and processing

)

Upvotes: 0

Views: 182

Answers (1)

Schwern
Schwern

Reputation: 164629

First thing to do before any optimization is to profile your code. Rather than guessing, this will tell you what lines are taking up the most time, and how often they're called. Devel::NYTProf is a good tool for the job.


This is a problem.

my @LookupFile = read_file ("$LookupFile");
my @newFile = read_file ("$newFile"); 
@LookupFile = split (/\n/,$LookupFile[0]);
@newLines = split (/\n/,$newFile[0]);

read_file reads the whole file in as one big string (it should be my $contents = read_file(...), using an array is awkward). Then it splits the whole thing into newlines, copying everything in the file. This is very slow and hard on memory and unnecessary.

Instead, use read_lines. This will split the file into lines as it reads avoiding a costly copy.

my @lookups = read_lines($LookupFile);
my @new     = read_lines($newFile);

Next problem is $newFile is opened again and iterated through line by line.

open FILE1, "$newFile" or die "Could not open file: $! \n";
while(my $line = <FILE1>) {

This is a waste as you've already read that file into memory. Use one or the other. However, in general, it's better to work with files line-by-line than to slurp them all into memory.


The above will speed things up, but they don't get at the crux of the problem. This is likely the real problem...

The contents in newFile will be searched in LookupFile and further processing happens.

You didn't show what you're doing, but I'm going to imagine it looks something like this...

for my $line (@lines) {
    for my $thing (@lookups) {
        ...
    }
}

That is, for each line in one file, you're looking at every line in the other. This is what is known as an O(n^2) algorithm meaning that as you double the size of the files you quadruple the time.

If each file has 10 lines, it will take 100 (10^2) turns through the inner loop. If they have 100 lines, it will take 10,000 (100^2). With 1,000 lines it will take 1,000,000 times.

With O(n^2) as the sizes get bigger things get very slow very quickly.

Could you please let me know if we can convert files into hash to increase performance?

You've got the right idea. You could convert the lookup file to a hash to speed things up. Let's say they're both lists of words.

# input
foo
bar
biff
up
down

# lookup
foo
bar
baz

And you want to check if any lines in input match any lines in lookup.

First you'd read lookup in and turn it into a hash. Then you'd read input and check if each line is in the hash.

use strict;
use warnings;
use autodie;
use v5.10;

...

# Populate `%lookup`
my %lookup;
{
    open my $fh, $lookupFile;
    while(my $line = <$fh>) {
        chomp $line;
        $lookup{$line} = 1;
    }
}

# Check if any lines are in %lookup
open my $fh, $inputFile;
while(my $line = <$fh>) {
    chomp $line;
    print $line if $lookup{$line};
}

This way you only iterate through each file once. This is an O(n) algorithm meaning is scales linearly, because hash lookups are basically instantaneous. If each file has 10 lines, it will only take 10 iterations of each loop. If they have 100 lines it will only take 100 iterations of each loop. 1000 lines, 1000 iterations.


Finally, what you really want to do is skip all this and create a database for your data and search that. SQLite is a SQL database that requires no server, just a file. Put your data in there and perform SQL queries on it using DBD::SQLite.

While this means you have to learn SQL, and there is a cost to building and maintaining the database, this is fast and most importantly very flexible. SQLite can do all sorts of searches quickly without you having to write a bunch of extra code. SQL databases are a very common, so it's a very good investment to learn SQL.

Since you're splitting the file up with my @columns = split(' ',$line); it's probably a file with many fields in it. That will likely map to a SQL table very well.

SQLite can even import files like that for you. See this answer for details on how to do that.

Upvotes: 2

Related Questions