Reputation: 1928

Advice on reading a 50GB file (and rewriting it to 16K files)!

I have a huge file (nearly 50GB, just a matrix in ASCII made of 360K lines, each with 15K numbers), and I need to transpose it. In order to avoid reading the whole thing in memory I just wrote a Perl script that opens 15K files (one for each column of the matrix) and proceeds by reading a complete line of the input file, and writing each number to the end of its corresponding file (the first number to output file column0.txt, the second number to output file column1.txt, etc.).

Things looked promising: the code only uses a constant 178MB of RAM and the initial tests with only part of the input file run perfectly: it processed 3600 lines in about a minute, so I was hoping to get the whole thing done in about two hours, but when I run the real thing the code stops at many points. For instance, at the beginning it processed ~4600 lines really quick and then stopped for quite a while (perhaps 5-10 minutes) before continuing. Right now, after ~10 hours of calculation, it has processed 131K lines and the code stops for two-three minutes after processing 300-400 lines.

I have never worked with such big input files or so many open files, so I'm not sure whether the problem lies with the input or with the number of file descriptors. Any advice on how to diagnose (and hopefully) solve the speed problem? I include the relevant part of the program below

Thanks

==================================

for ($i=0 ; $i<$columnas ; $i++) {
    $column[$i]  = IO::File->new(">column$i.txt") or die $!;
}

while (<DATA>) {
    chomp;
    $cols = split;

    for ($col=0 ; $col<$cols ; $col++) {
        print { $column[$col] } "$_[$col] " ;
    }
}

close (DATA) or die $!;

Upvotes: 4

Answers (3)

TLP

Reputation: 67900

Some thoughts

1. Implicit split to @_

$cols = split;

Gives the warning:

Use of implicit split to @_ is deprecated

In case you are not already doing it, you should add

use warnings;
use strict;

to your script. (And heed those warnings.)

Consider changing $colsto@cols, and instead using $#cols in the for loop. E.g.

@cols = split;
for (my $col=0; $col <= $#cols; $col++)

2. No chomp required?

From split()inperlfunc:

If PATTERN is also omitted, splits on whitespace (after skipping any leading whitespace).

Which means your newline character should also be stripped, as it is counted as a whitespace character.

Therefore, chomp() is not required.

3. Number of open files

I believe perl's open() is fairly fast, so it might be worth caching your data like weismat suggested. While you are doing that, you might as well share a single filehandle for all the files, and open them only while printing the cache. E.g.:

for ($i = 0; $i <= $#column; $i++) {
    open OUT, ">> column$i.txt" or die $!;
    print OUT $column[$i];
}

ETA: @column here contains the columns transposed from DATA. Instead of print, use:

$column[$col] .= $cols[$col] . " ";

Upvotes: 1

weismat

Reputation: 7411

Check /proc/sys/fs/file-max to see the maximum number of open files.
You may need to read the files using seek so that you can control the number of open files for reading accordingly.
The best will be do cache x lines and then append to all files then.

Upvotes: 1

ysth

Reputation: 98388

Given that you've got weird stuff resulting, checking that your prints succeed may be a good idea:

print { $column[$col] } "$_[$col] "
    or die "Error printing column $col: $! ";

Try flushing every 500 lines or so? use IO::Handle; and after the print:

if ( $. % 500 == 0 ) {
    $column[$col]->flush()
        or die "Flush of column $col failed: $! ";
}

Upvotes: 0

Advice on reading a 50GB file (and rewriting it to 16K files)!

Answers (3)

Related Questions