Smart split file with gzipping each part?

Question

I have a very long file with numbers. Something like output of this perl program:

perl -le 'print int(rand() * 1000000) for 1..10'

but way longer - around hundreds of gigabytes.

I need to split this file into many others. For test purposes, let's assume that 100 files, and output file number is taken by taking module of number with 100.

With normal files, I can do it simply with:

perl -le 'print int(rand() * 1000000) for 1..1000' | awk '{z=$1%100; print > z}'

But I have a problem when I need to compress splitted parts. Normally, I could:

... | awk '{z=$1%100; print | "gzip -c - > "z".txt.gz"}'

But, when ulimit is configured to allow less open files than number of "partitions", awk breaks with:

awk: (FILENAME=- FNR=30) fatal: can't open pipe `gzip -c - > 60.txt.gz' for output (Too many open files)

This doesn't break with normal file output, as GNU awk is apparently smart enough to recycle file handles.

Do you know any way (aside from writing my own stream-splitting-program, implementing buffering, and some sort of pool-of-filehandles management) to handle such case - that is: splitting to multiple files, where access to output files is random, and gzipping all output partitions on the fly?

user80168 · Accepted Answer

I didn't write it in question itself, but since the additional information is together with solution, I'll write it all here.

So - the problem was on Solaris. Apparently there is a limitation, that no program using stdio on Solaris can have more than 256 open filehandles ?!

It is described in here in detail. The important point is that it's enough to set one env variable before running my problematic program, and the problem is gone:

export LD_PRELOAD_32=/usr/lib/extendedFILE.so.1

Smart split file with gzipping each part?

Answers (1)

Related Questions