Reputation: 31

Comparing Lines Within a Single File

I'm wondering if there is a more efficient way of doing what I am attempting. I have a need to read a file and compare each line of the file with all the ones following it. (i.e. compare line 1 with lines 2,3,4,5,...; line 2 with 3,4,5,...; line 3 with 4,5... etc). I feel some of the files are too large to read entirely into @lists, so I do this now by opening two file handles to the same file and use tell and seek to get the position of one file and set the position of the other something like this:

open FH1, '/arf/test.txt';
open FH2, '/arf/test.txt';

while ($outer = <FH1>){
    chomp($outer);
    $fh1Posn = tell FH1;
    seek FH2, $fh1Posn, 0;
    while ($inner = <FH2>){
        [code to compare line];
    }
}

This works absolutely fine in small files, but is a drag with some of the larger files I need to process (even when I substitute dummy code for the comparison portion). Perhaps this is to be expected, but I'm wondering if there is anything I could try to speed up the just the reading of the file? I was thinking of just using a single file handle, eliminating the second and using $fh1Posn at the bottom of the while loop to reset the cursor back to where it was at the top, something like:

open FH1, '/arf/test.txt';

while ($outer = <FH1>){
    chomp($outer);
    $fh1Posn = tell FH1;
    while ($inner = <FH1>){
        [code to compare line];
    }
    seek FH1, $fh1Posn, 0;
}

Haven't tried this yet - will do so - but feel it probably won't do anything worth the trouble.

Upvotes: 3

Answers (1)

Miller

Reputation: 35198

The fastest solution would be to do it entirely in memory, limiting your reading of the disk to just a single pass. If you don't have enough memory to do that though, then I suggest splitting it in chunks.

You're current solution is to read the file O(n^2) times is the worst case solution with not enough memory. If you do it in groups of 1000 lines, you can reduce your disk operations considerably. You'd have to experiment to find where you real memory limit is of course.

my $buffersize = 1000;

open FH1, '/arf/test.txt';
open FH2, '/arf/test.txt';

while (! eof(FH1)) {
    my @buffer = ();
    for (1..$buffersize) {
        my $outer = <FH1>;
        chomp $outer;
        push @buffer, $outer;
        last if eof(FH1);
    }

    # Seek code here

    while ($inner = <FH2>){
        for (@buffer) {
            [code to compare line];
        }
    }
}

Addendum on Friday, Feb 25th - More Detailed Code Solution

I had some additional time at the end of the day, so I coded up a version of this that works on a file named test.txt that contains the numbers 1-10 on each line. I therefore configured the buffersize to just 4 so that the script does 4 passes of the file instead of 11, like your original method would.

use strict;
use warnings;
use autodie;

my $buffersize = 4; # 1000;
my $file = 'test.txt'; # '/arf/test.txt';

open my $fh1, '<', $file;
open my $fh2, '<', $file;

while (! eof($fh1)) {

    # Move fh2 to start of buffer
    my $startofbuffer = tell($fh1);
    seek($fh2, $startofbuffer, 0);

    # Build Buffer of entries to compare
    my @buffer = ();
    for (1..$buffersize) {
        chomp(my $outer = <$fh1>);
        print "buffer, $.\n";
        push @buffer, $outer;
        last if eof($fh1);
    }

    # Keep track of relative offset to start of buffer
    for (my $offset = 0; !eof($fh2); $offset++) {
        chomp(my $inner = <$fh2>);

        for my $i (0..$#buffer) {
            last if $i >= $offset;
            my $outer = $buffer[$i];
            print "Compare outer $outer to inner $inner\n";
        }
    }
}

The output of this is as follows

buffer, 1
buffer, 2
buffer, 3
buffer, 4
Compare outer 1 to inner 2
Compare outer 1 to inner 3
Compare outer 2 to inner 3
Compare outer 1 to inner 4
Compare outer 2 to inner 4
Compare outer 3 to inner 4
Compare outer 1 to inner 5
Compare outer 2 to inner 5
Compare outer 3 to inner 5
Compare outer 4 to inner 5
Compare outer 1 to inner 6
Compare outer 2 to inner 6
Compare outer 3 to inner 6
Compare outer 4 to inner 6
Compare outer 1 to inner 7
Compare outer 2 to inner 7
Compare outer 3 to inner 7
Compare outer 4 to inner 7
Compare outer 1 to inner 8
Compare outer 2 to inner 8
Compare outer 3 to inner 8
Compare outer 4 to inner 8
Compare outer 1 to inner 9
Compare outer 2 to inner 9
Compare outer 3 to inner 9
Compare outer 4 to inner 9
Compare outer 1 to inner 10
Compare outer 2 to inner 10
Compare outer 3 to inner 10
Compare outer 4 to inner 10
buffer, 5
buffer, 6
buffer, 7
buffer, 8
Compare outer 5 to inner 6
Compare outer 5 to inner 7
Compare outer 6 to inner 7
Compare outer 5 to inner 8
Compare outer 6 to inner 8
Compare outer 7 to inner 8
Compare outer 5 to inner 9
Compare outer 6 to inner 9
Compare outer 7 to inner 9
Compare outer 8 to inner 9
Compare outer 5 to inner 10
Compare outer 6 to inner 10
Compare outer 7 to inner 10
Compare outer 8 to inner 10
buffer, 9
buffer, 10
Compare outer 9 to inner 10

Addendum on Sunday, Mar 2nd - Benchmarks

Your report of no improvement in speed made me a little curious, so I created a little script to make some fake data:

use strict;
use warnings;
use autodie;

my $lines = shift or die "Missing line count\n";
die "Lines outside of range 1 - 1,000,000" if $lines < 1 or $lines > 1_000_000;

my $fakedata = 70;

my $filename = 'fd' . sprintf("%06d", $lines) . '.txt';

open my $fh, '>', $filename;

for my $i (1..$lines) {
    my $fake = join '', map {('a'..'z')[int rand 26]} (1..$fakedata);

    $fh->print("$i fake${fake}fake\n");
}

close $fh;

print "Created $filename\n";

1;

__END__

I then edited the detailed code I provided above so that it didn't output any debugging statments and instead did a very basis comparison. This led to the following results for files with line counts of 1_000, 10_000 and 20_000.

                For Buffer size, show Time in sec and (# of File Reads)
                --------------------------------------------------------
File by lines   b = 1        b = 10      b = 100     b = 1k      b = 10k  
-------------   -----        ------      -------     ------      -------  
Lines = 1k      t = 1.54s    t = 0.35s   t = 0.22s   t = 0.21             
Size = 88k      r = (1001)   r = (101)   r = (11)    r = (2)              
Tests = 500k                                                              

Lines = 10k     t = 185s     t = 35s     t = 23s     t = 21.7s   t = 21.5s
Size = 899k     r = (10k)    r = (1k)    r = (101)   r = (11)    r = (2)  
Tests = 50m                                                               

Lines = 20k     t = 593s     t = 136s    t = 90s     t = 86s     t = 85.5s
Size = 1.8m     r = (20k)    r = (2k)    r = (201)   r = (21)    r = (3)  
Tests = 200m

As you can see, the buffering even just 100, brought the script time down by an order of 5 or more. These scripts still take a long time though, as the number of comparisons is equal to N * N / 2. That's 200 million tests for the 20k file, which is why you can see the time it takes to do that file be 4 times as long as the 10k file.

If these values were to hold true, your 250k length file would take 156.25 times longer than the 20k file. That equates to 3.71 hours in the best case scenario with buffering or 25.7 hours without buffering at all. These numbers even assume the absolute minimum time for comparison, and I assume yours is probably more complicated than a simple old/even test.

Unfortunately, you don't say anything about the nature of your project and these comparisons, so I can't really conjecture on other possible speed improvements. If I were to assume that your goal was more of the nature of sorting, then it would be possible to reduce your comparisons to O(n log n) instead of O(N**2). For sorting I'd advise you break the file into groups able to fit into memory, use perl sort to sort them, and then use a merge sort to merge the sorted groups. I won't bother providing more detailed code though since it's just conjecture that's what you're up to.

Anyway, good luck with your project.

Upvotes: 1

Comparing Lines Within a Single File

Answers (1)

Related Questions