Reputation: 1848
The structure of my input data file is such that it is more logical to read the data in blocks of N
lines rather than a single line at a time. Of course, I could use something straightforward like
my @lines=();
while(!eof($FH)) {
for(my $i=0;$i<$N;$i++)
$lines[$i]=<FH>;
chomp();
}
# proceed with analysis of N-size block ##
}
Because the input files are very large (GB), however, I wonder whether there is a more efficient solution than a for
loop. For instance, I found another solution online that uses the map
function, although when I try to implement it in my script, it results in an error ("my" variable @lines masks earlier declaration in same statement
):
while(( my @lines = map $_ = <>, 1 .. 4 )[0]) {
print @lines;
print "\n";
}
Admittedly, I don't understand the significance of the [0]
in the while block of this code, and another solution suggested using [-1]
instead.
Given the I/O intensiveness of the operation, I wonder about what would be the most computationally efficient solution to this problem (within the bounds of the Perl programming language).
Upvotes: 1
Views: 2029
Reputation: 126722
By far the slowest bottleneck in any file IO is disk itself. Perl reads the file in arbitrarily large chunks and searches through them for newlines so that it can hand the data to you one line at a time. That means that any scheme to read multiple lines at a time will take only a tiny fraction of the time for the next chunk to be read from disk. So, as usual, the most prevalent criterion is how readable the code is.
Once I started coding I could see why the most obvious solution would be a map
. Unfortunately it would look like this
use strict;
use warnings;
use Data::Dump;
use constant N => 4;
while (my @block = grep defined, map { scalar <DATA> } 1 .. N) {
dd \@block;
}
__DATA__
1
2
3
4
5
6
7
8
9
output
["1\n", "2\n", "3\n", "4\n"]
["5\n", "6\n", "7\n", "8\n"]
["9\n"]
But it can be written more cleanly. So far I like this the best
use strict;
use warnings;
use Data::Dump;
use constant N => 4;
until (eof DATA) {
my ($rec, @block);
push @block, $rec while @block < N and $rec = <DATA>;
dd \@block;
}
__DATA__
1
2
3
4
5
6
7
8
9
which has identical output.
I'm thinking about something like
while (do { ... }) {
dd \@block;
}
but I'm not there yet!
Upvotes: 3
Reputation: 35198
For simplicity, I would probably advise reading from the main while loop and adding to a buffer:
my @buffer;
while (<$FH>) {
push @buffer, $_;
if (@buffer == $N || eof) {
print @buffer;
@buffer = ();
}
}
Algorithmically, I don't expect any particular method to be significantly faster than any other. You could try playing around with other methods of reading from the filehandle, but ultimately, I would not expect to find any major speed improvements.
Upvotes: 5