How can I match data from two large files in Perl?

Question

I have 2 (large) files. The first one is about 200k lines, the second one about 30 millions lines.

I want to check if each line of the first one is in the second one using Perl. Is it faster to compare directly each line of the first to each line of the second or is it better to store them all in two different arrays and then manipulate arrays?

beasy · Accepted Answer

Store the first file's lines in a hash, then iterate through the second file without storing it in memory.

It might be counterintuitive to store the first file and iterate the second file as opposed to vice-versa, but it allows you to avoid creating a 30 million element hash.

use feature 'say';

my ($path_1, $path_2) = @ARGV;

open my $fh1,"<",$path_1;
my %f1;
$f1{$_} = $. while (<$fh1>);

open my $fh2,"<",$path_2;
while (<$fh2>) {
    if (my $f1_line = $f1{$_}) {
        say "file 1 line $f1_line appears in file 2 line $.";  
    }
}

Note that without further processing, the duplicated lines will display in the order they appear in the second file, not first.

Also, this assumes file 1 does not have duplicate lines, but that can be handled if necessary.

How can I match data from two large files in Perl?

Answers (2)

Related Questions