Akan
Akan

Reputation: 272

Parsing the large files in Perl

I need to compare the big file(2GB) contains 22 million lines with the another file. its taking more time to process it while using Tie::File.so i have done it through 'while' but problem remains. see my code below...

use strict;
use Tie::File;
# use warnings;
my @arr;
# tie @arr, 'Tie::File', 'title_Nov19.txt';

# open(IT,"<title_Nov19.txt");                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
# my @arr=<IT>;
# close(IT);
open(RE,">>res.txt");

open(IN,"<input.txt");

while(my $data=<IN>){
    chomp($data);
    print"$data\n";
    my $occ=0;

    open(IT,"<title_Nov19.txt");    
    while(my $line2=<IT>){

        my $line=$line2;
        chomp($line);

        if($line=~m/\b$data\b/is){

            $occ++;

        }

    }
print RE"$data\t$occ\n";
}


close(IT);
close(IN);
close(RE);

so help me to reduce it...

Upvotes: 0

Views: 6044

Answers (4)

Ademir F Furtado
Ademir F Furtado

Reputation: 1

Try this:

grep -i -c -w -f input.txt title_Nov19.txt > res.txt

Upvotes: 0

LeoNerd
LeoNerd

Reputation: 8542

Lots of things wrong with this.

Asides from the usual (lack of use strict, use warnings, use of 2-argument open(), not checking open() result, use of global filehandles), the specific problem in your case is that you are opening/reading/closing the second file once for every single line of the first. This is going to be very slow.

I suggest you open the file title_Nov19.txt once, read all the lines into an array or hash or something, then close it; and then you can open the first file, input.txt and walk along that once, comparing to things in the array so you don't have to reopen that second file all the time.

Futher I suggest you read some basic articles on style/etc.. as your question is likely to gain more attention if it's actually written in vaguely modern standards.

Upvotes: 2

Kenosis
Kenosis

Reputation: 6204

Here's another option using memowe's (thank you) data:

use strict;
use warnings;
use File::Slurp qw/read_file write_file/;

my %count;
my $regex = join '|', map { chomp; $_ = "\Q$_\E" } read_file 'input.txt';

for ( read_file 'title_Nov19.txt' ) {
    my %seen;
    !$seen{ lc $1 }++ and $count{ lc $1 }++ while /\b($regex)\b/ig;
}

write_file 'res.txt', map "$_\t$count{$_}\n",
  sort { $count{$b} <=> $count{$a} } keys %count;

Numerically-sorted output to res.txt:

foo 3
bar 1

An alternation regex which quotes meta characters (\Q$_\E) is built and used, so only one pass against the large file's lines is needed. The hash %seen is used to insure that the input words are only counted once per line.

Hope this helps!

Upvotes: 0

memowe
memowe

Reputation: 2668

I tried to build a small example script with a better structure but I have to say, man, your problem description is really very unclear. It's important to not read the whole comparison file each time as @LeoNerd explained in his answer. Then I use a hash to keep track of the match count:

#!/usr/bin/env perl

use strict;
use warnings;

# cache all lines of the comparison file
open my $comp_file, '<', 'input.txt' or die "input.txt: $!\n";
chomp (my @comparison = <$comp_file>);
close $comp_file;

# prepare comparison
open my $input,  '<', 'title_Nov19.txt' or die "title_Nov19.txt: $!\n";
my %count = ();

# compare each line
while (my $title = <$input>) {
    chomp $title;

    # iterate comparison strings
    foreach my $comp (@comparison) {
        $count{$comp}++ if $title =~ /\b$comp\b/i;
    }
}

# done
close $input;

# output (sorted by count)
open my $output, '>>', 'res.txt' or die "res.txt: $!\n";
foreach my $comp (@comparison) {
    print $output "$comp\t$count{$comp}\n";
}
close $output;

Just to get you started... If someone wants to further work on this: these were my test files:

title_Nov19.txt

This is the foo title
Wow, we have bar too
Nothing special here but foo
OMG, the last title! And Foo again!

input.txt

foo
bar

And the result of the program was written to res.txt:

foo 3
bar 1

Upvotes: 0

Related Questions