Aditya
Aditya

Reputation: 364

Splitting large text files with Perl

I have to split a large, 1.8Tb text file in two (I need only the second half of the file). The file has \n as the record separator.

I tried

perl -ne 'print if $. >= $line_to_start_from' test.txt > result.txt 

on a much smaller, 115Mb test file and it did the job but took 22 seconds.

Using this solution for a 1.8Tb file will take unreasonably long time, so my question is whether there is a way in Perl to split huge files without looping over them?

Upvotes: 3

Views: 787

Answers (1)

AKHolland
AKHolland

Reputation: 4445

By default perl reads file input one line at a time. If your file contains lots of relatively short lines (and I'm assuming it does), perl will be a lot slower than utilities like split which read in bigger chunks from the file at a time.

For testing, I created a ~200MB file with very short lines:

$ perl -e 'print "123\n" for( 1 .. 50_000_000 );' >file_to_split

split can handle it pretty reasonably:

$ time split --lines=25000000 file_to_split half

real    0m1.266s
user    0m0.314s
sys     0m0.213s

And the naïve perl approach is much slower:

$ time perl -ne 'print if $. > 25_000_000' file_to_split >second_half

real    0m10.474s
user    0m10.257s
sys     0m0.222s

But you can use the $/ special variable to cause perl to read more than one line at a time. For example 16 kb of data at a time:

my $CHUNK_SIZE = 16 * 1024;
my $SPLIT_AT_LINE = 25_000_000;

{
    local $/ = \$CHUNK_SIZE;
    my $lineNumber = 0;
    while ( <> ) {
        if ( $lineNumber > $SPLIT_AT_LINE ) {
            # everything from here on is in the second half
            print $_;
        }
        else {
            my $count = $_ =~ tr/\n/\n/;
            $lineNumber += $count;
            if ( $lineNumber > $SPLIT_AT_LINE ) {
                # we went past the split, get some of the lines from this buffer
                my $extra = $lineNumber - $SPLIT_AT_LINE;
                my @lines = split m/\n/, $_, $count - $extra + 1;
                print $lines[ -1 ];
            }
        }
    }
}

If you don't care about overshooting the split by a few lines, you could make this code even simpler. And this gets perl to do the same operation in a reasonable amount of time:

$ time perl test.pl file_to_split >second_half

real    0m0.678s
user    0m0.095s
sys     0m0.297s

Upvotes: 4

Related Questions