VeZoul
VeZoul

Reputation: 510

How can I match data from two large files in Perl?

I have 2 (large) files. The first one is about 200k lines, the second one about 30 millions lines.

I want to check if each line of the first one is in the second one using Perl. Is it faster to compare directly each line of the first to each line of the second or is it better to store them all in two different arrays and then manipulate arrays?

Upvotes: 0

Views: 114

Answers (2)

beasy
beasy

Reputation: 1227

Store the first file's lines in a hash, then iterate through the second file without storing it in memory.

It might be counterintuitive to store the first file and iterate the second file as opposed to vice-versa, but it allows you to avoid creating a 30 million element hash.

use feature 'say';

my ($path_1, $path_2) = @ARGV;

open my $fh1,"<",$path_1;
my %f1;
$f1{$_} = $. while (<$fh1>);

open my $fh2,"<",$path_2;
while (<$fh2>) {
    if (my $f1_line = $f1{$_}) {
        say "file 1 line $f1_line appears in file 2 line $.";  
    }
}

Note that without further processing, the duplicated lines will display in the order they appear in the second file, not first.

Also, this assumes file 1 does not have duplicate lines, but that can be handled if necessary.

Upvotes: 1

Sinan &#220;n&#252;r
Sinan &#220;n&#252;r

Reputation: 118128

You have File A and File B. You want to check if lines in File A appear in File B.

If you have enough memory to hold the contents of File B in a hash using one entry per line, that's the simplest. Go ahead.

However, if you do not, I recommend you put both files in tables in an SQL database. SQLite might be enough to start. Then, your problem is reduced to a simple JOIN. If line length is an issue, use a fast hash such as xxHash. If implemented correctly, the 64-bit version is blazing fast on a 64-bit machine, especially if you enabled optimizations in your Perl. Store two columns, hash and the actual line. If hashes match, check if the lines match. Make sure to index on the hash column.

You say:

In fact, my files are like : File A : name number (per line) File B : name date location number (per line) And I have to check if File B contains the lines matching datas of File A (ignoring date and location for example) So it's not an exact match ...

In that case, you are set. You do not even have to worry about the hash stuff (which I am leaving here for reference). Put the interesting bits of data on which you need to match against in separate columns in an SQLite database. Write a join. ... Profit.

Alternatively, you could use BerkeleyDB which gives you the conceptual simplicity of having an in memory hash while storing the table on disk. If you have multiple attributes on which to match, this will not scale well.

Upvotes: 7

Related Questions