Reputation: 6511
My situation is the following: a big (10GB) compressed file containing some files (~60) with a total uncompressed size of 150GB.
I would like to be able to slice big compressed log files into parts that have a certain number of lines in them (ie: 1 million).
I don't want to use split since it involves totally decompressing the original file, and i don't have that much disk space available.
What i am doing so far is this:
#!/bin/bash
SAVED_IFS=$IFS
IFS=$(echo -en "\n\b")
for file in `ls *.rar`
do
echo Reading file: $file
touch $file.chunk.uncompressed
COUNTER=0
CHUNK_COUNTER=$((10#000))
unrar p $file while read line;
do
echo "$line" >> $file.chunk.uncompressed
let COUNTER+=1
if [ $COUNTER -eq 1000000 ]; then
CHUNK_COUNTER=`printf "%03d" $CHUNK_COUNTER;`
echo Enough lines \($COUNTER\) to create a compressed chunk \($file.chunk.compressed.$CHUNK_COUNTER.bz2\)
pbzip2 -9 -c $file.chunk.uncompressed > $file.chunk.compressed.$CHUNK_COUNTER.bz2
# 10# is to force bash to count in base 10, so that 008+ are valid
let CHUNK_COUNTER=$((10#$CHUNK_COUNTER+1))
let COUNTER=0
fi
done
#TODO need to compress lines in the last chunk too
done
IFS=$SAVED_IFS
What i don't like about it, is that i am limited by the speed of writing and then reading uncompressed chunks (~15MB/s). The speed of reading the uncompressed stram directly from the compressed file is ~80MB/s.
How can i adapt this script to stream directly a limited number of lines per chunk while directly writing to a compressed file?
Upvotes: 1
Views: 908
Reputation: 47074
You can pipe the output to a loop in which you use head
to chop the files.
$ unrar p $file | ( while :; do i=$[$i+1]; head -n 10000 | gzip > split.$i.gz; done )
The only thing you have to work out still, is how to terminate the loop, since this will go on generating empty files. This is left as an excercise to the reader.
Zipping an empty file will give some output (for gz, it's 26 bytes) so you could test for that:
$ unrar p $file |
( while :; do
i=$[$i+1];
head -n 10000 | gzip > split.$i.gz;
if [ `stat -c %s split.$i.gz` -lt 30 ]; then rm split.$i.gz; break; fi;
done )
Upvotes: 2
Reputation: 80031
If you don't mind wrapping the file in a tar file, than you can use tar
to split and compress the file for you.
You can use tar -M --tape-length 1024
to create 1 megabyte files. Do note that after every 100 megabyte tar will ask you to press enter before it starts writing to the file again. So you will have to wrap it with your own script and move the resulting file before doing so.
Upvotes: -1