user001
user001

Reputation: 1848

efficiently reading N lines at a time from an input file in perl

The structure of my input data file is such that it is more logical to read the data in blocks of N lines rather than a single line at a time. Of course, I could use something straightforward like

my @lines=();
while(!eof($FH)) {
  for(my $i=0;$i<$N;$i++)
   $lines[$i]=<FH>;
   chomp();
  }
  # proceed with analysis of N-size block ##
}

Because the input files are very large (GB), however, I wonder whether there is a more efficient solution than a for loop. For instance, I found another solution online that uses the map function, although when I try to implement it in my script, it results in an error ("my" variable @lines masks earlier declaration in same statement):

while(( my @lines = map $_ = <>, 1 .. 4 )[0]) {
  print @lines;
  print "\n";
}

Admittedly, I don't understand the significance of the [0] in the while block of this code, and another solution suggested using [-1] instead.

Given the I/O intensiveness of the operation, I wonder about what would be the most computationally efficient solution to this problem (within the bounds of the Perl programming language).

Upvotes: 1

Views: 2029

Answers (2)

Borodin
Borodin

Reputation: 126722

By far the slowest bottleneck in any file IO is disk itself. Perl reads the file in arbitrarily large chunks and searches through them for newlines so that it can hand the data to you one line at a time. That means that any scheme to read multiple lines at a time will take only a tiny fraction of the time for the next chunk to be read from disk. So, as usual, the most prevalent criterion is how readable the code is.

Once I started coding I could see why the most obvious solution would be a map. Unfortunately it would look like this

use strict;
use warnings;

use Data::Dump;

use constant N => 4;

while (my @block = grep defined, map { scalar <DATA> } 1 .. N) {
  dd \@block;
}

__DATA__
1
2
3
4
5
6
7
8
9

output

["1\n", "2\n", "3\n", "4\n"]
["5\n", "6\n", "7\n", "8\n"]
["9\n"]

But it can be written more cleanly. So far I like this the best

use strict;
use warnings;

use Data::Dump;

use constant N => 4;

until (eof DATA) {
  my ($rec, @block);
  push @block, $rec while @block < N and $rec = <DATA>;
  dd \@block;
}

__DATA__
1
2
3
4
5
6
7
8
9

which has identical output.

I'm thinking about something like

while (do { ... }) {
   dd \@block;
}

but I'm not there yet!

Upvotes: 3

Miller
Miller

Reputation: 35198

For simplicity, I would probably advise reading from the main while loop and adding to a buffer:

my @buffer;

while (<$FH>) {
    push @buffer, $_;

    if (@buffer == $N || eof) {
        print @buffer;
        @buffer = ();
    }
}

Algorithmically, I don't expect any particular method to be significantly faster than any other. You could try playing around with other methods of reading from the filehandle, but ultimately, I would not expect to find any major speed improvements.

Upvotes: 5

Related Questions