Reputation: 364
I have to split a large, 1.8Tb text file in two (I need only the second half of the file). The file has \n
as the record separator.
I tried
perl -ne 'print if $. >= $line_to_start_from' test.txt > result.txt
on a much smaller, 115Mb test file and it did the job but took 22 seconds.
Using this solution for a 1.8Tb file will take unreasonably long time, so my question is whether there is a way in Perl to split huge files without looping over them?
Upvotes: 3
Views: 787
Reputation: 4445
By default perl reads file input one line at a time. If your file contains lots of relatively short lines (and I'm assuming it does), perl will be a lot slower than utilities like split
which read in bigger chunks from the file at a time.
For testing, I created a ~200MB file with very short lines:
$ perl -e 'print "123\n" for( 1 .. 50_000_000 );' >file_to_split
split
can handle it pretty reasonably:
$ time split --lines=25000000 file_to_split half
real 0m1.266s
user 0m0.314s
sys 0m0.213s
And the naïve perl approach is much slower:
$ time perl -ne 'print if $. > 25_000_000' file_to_split >second_half
real 0m10.474s
user 0m10.257s
sys 0m0.222s
But you can use the $/
special variable to cause perl to read more than one line at a time. For example 16 kb of data at a time:
my $CHUNK_SIZE = 16 * 1024;
my $SPLIT_AT_LINE = 25_000_000;
{
local $/ = \$CHUNK_SIZE;
my $lineNumber = 0;
while ( <> ) {
if ( $lineNumber > $SPLIT_AT_LINE ) {
# everything from here on is in the second half
print $_;
}
else {
my $count = $_ =~ tr/\n/\n/;
$lineNumber += $count;
if ( $lineNumber > $SPLIT_AT_LINE ) {
# we went past the split, get some of the lines from this buffer
my $extra = $lineNumber - $SPLIT_AT_LINE;
my @lines = split m/\n/, $_, $count - $extra + 1;
print $lines[ -1 ];
}
}
}
}
If you don't care about overshooting the split by a few lines, you could make this code even simpler. And this gets perl to do the same operation in a reasonable amount of time:
$ time perl test.pl file_to_split >second_half
real 0m0.678s
user 0m0.095s
sys 0m0.297s
Upvotes: 4