Reputation: 94820

How to sort rows in a text file in Perl?

I have a couple of text files (A.txt and B.txt) which look like this (might have ~10000 rows each)

processa,id1=123,id2=5321
processa,id1=432,id2=3721
processa,id1=3,id2=521
processb,id1=9822,id2=521
processa,id1=213,id2=1
processc,id1=822,id2=521

I need to check if every row in file A.txt is present in B.txt as well (B.txt might have more too, that is okay).

The thing is that rows can be in any order in the two files, so I am thinking I will sort them in some particular order in both the files in O(nlogn) and then match each line in A.txt to the next lines in B.txt in O(n). I could implement a hash, but the files are big and this comparison happens only once after which these files are regenerated, so I don't think that is a good idea.

What is the best way to sort the files in Perl? Any ordering would do, it just needs to be some ordering.

For example, in dictionary ordering, this would be

processa,id1=123,id2=5321
processa,id1=213,id2=1
processa,id1=3,id2=521
processa,id1=432,id2=3721
processb,id1=9822,id2=521
processc,id1=822,id2=521

As I mentioned before, any ordering would be just as fine, as long as Perl is fast in doing it.

I want to do it from within Perl code, after opening the file like so

open (FH, "<A.txt");

Any comments, ideas etc would be helpful.

Upvotes: 3

Answers (6)

HerbN

Reputation: 1509

Test if A.txt is a subset of B.txt

open FILE.B, "B.txt";
open FILE.A, "A.txt";

my %bFile;

while(<FILE.B>) {
   ($process, $id1, $id2) = split /,/;
   $bFile{$process}{$id1}{$id2}++;
}

$missingRows = 0;

while(<FILE.A>) {
   $missingRows++ unless $bFile{$process}{$id1}{$id2};
   # If we've seen a given entry already don't add it
   next if $missingRows; # One miss means they aren't all verified
}

$is_Atxt_Subset_Btxt = $missingRows?FALSE:TRUE;

That will give you a test for all rows in A being in B with only reading in all of B and then testing each member of the array while reading A.

Upvotes: 0

FMc

Reputation: 42411

Here's another way to do it. The idea is to create a flexible data structure that allows you to answer many kinds of questions easily with grep.

use strict;
use warnings;

my ($fileA, $fileB) = @ARGV;

# Load all lines: $h{LINE}{FILE_NAME} = TALLY
my %h;
$h{$_}{$ARGV} ++ while <>;

# Do whatever you need.
my @all_lines = keys %h;
my @in_both   = grep {     keys %{$h{$_}} == 2       } keys %h;
my @in_A      = grep {     exists $h{$_}{$fileA}     } keys %h;
my @only_in_A = grep { not exists $h{$_}{$fileB}     } @in_A;
my @in_A_mult = grep {            $h{$_}{$fileA} > 1 } @in_A;

Upvotes: 1

cjm

Reputation: 62099

As usual, CPAN has an answer for this. Either Sort::External or File::Sort looks like it would work. I've never had occasion to try either, so I don't know which would be better for you.

Another possibility would be to use AnyDBM_File to create a disk-based hash that can exceed available memory. Without trying it, I couldn't say whether using a DBM file would be faster or slower than the sort, but the code would probably be simpler.

Upvotes: 0

DVK

Reputation: 129403

May I ask why you must do this in native Perl? If the cost of calling a system call or 3 is not an issue (e.g. you do this infrequently and not in a tight loop), why not simply do:

my $cmd = "sort $file1 > $file1.sorted";
$cmd .= "; sort $file2 > $file2.sorted";
$cmd .= "; comm -23 $file1.sorted $file2.sorted |wc -l";
my $count = `$cmd`;
$count =~ s/\s+//g;
if ($count != 0) {
    print "Stuff in A exists that aren't in B\n";
}

Please note that comm parameter might be different, depending on what exactly you want.

Upvotes: 0

zigdon

Reputation: 15063

To sort the file in your script, you will still have to load the entire thing into memory. If you're doing that, I'm not sure what's the advantage of sorting it vs just loading it into a hash?

Something like this would work:

my %seen;
open(A, "<A.txt") or die "Can't read A: $!";
while (<A>) {
    $seen{$_}=1;
}
close A;

open(B, "<B.txt") or die "Can't read B: $!";
while(<B>) {
  delete $seen{$_};
}
close B;

print "Lines found in A, missing in B:\n";
join "\n", keys %seen;

Upvotes: 5

recursive9

Reputation: 340

well, i routinely parse very large (600MB) daily Apache log files with Perl, and to store the information i use a hash. I also go through about 30 of these files, in one script instance, using the same hash. its not a big deal assuming you have enough RAM.

Upvotes: 0

How to sort rows in a text file in Perl?

Answers (6)

Related Questions