Dracarys
Dracarys

Reputation: 291

how to improve grep efficiency in perl when the file number is huge

I want to grep some log information from the log files located in the following directory structure using perl: $jobDir/jobXXXX/host.log where XXXX is a job number, from 1 to a few thousands. There's no other kinds of sub directory under $jobDir and no other files except logs under jobXXXX. The script is :

my  @Info;  #store the log informaiton
my $Num = 0;
@Info = qx(grep "information" -r $jobDir); #is this OK ?

foreach(@Info){
        if($_=~ /\((\d+)\)(.*)\((\d+)\)/){
            Output(xxxxxxxx);   
        }
        $Num=$Num+1; #number count      
    }

It is found that when then job number is a few thousands, this script will take very long time to output the information.

Is there any way to improve its efficiency?

Thanks!

Upvotes: 1

Views: 210

Answers (2)

Lee Duhem
Lee Duhem

Reputation: 15121

You should search those log file one by one, and scan each log file line by line, instead of reading the output of grep to memory (that could cost lots of memory, and slow your program, even your system):

# untested script

my $Num;
foreach my $log (<$jobDir/job*/host.log>) {
    open my $logfh, '<', "$log" or die "Cannot open $log: $!";
    while (<$logfh>) {
        if (m/information/) {
            if(m/\((\d+)\)(.*)\((\d+)\)/) {
                Output(xxx);
            }
            $Num++;
        }
    }
    close $logfh;
}

Upvotes: 5

Steffen Ullrich
Steffen Ullrich

Reputation: 123270

While it would be more elegant to use the matching built into perl (see the other answer), calling the grep command can be more efficient and faster, especially if there are lots of data but only few matches. But the way you call it is to first run grep and collect all data, and then scan through all the data. This will need more memory because you first collect all data, and you have to wait for the output until all data are collected. Better would be to output as soon as the first data are collected:

open( my $fh,'-|','grep',"information",'-r',$jobDir) or die $!;
while (<$fh>) {
    if(/\((\d+)\)(.*)\((\d+)\)/){
        Output(xxxxxxxx);
    }
    $Num=$Num+1; #number count      
}

Upvotes: 5

Related Questions