Reputation: 13666
I have a directory which contains thousands of .gz files. Now I want to find the largest uncompressed file size without unzipping it. For e.g dir1 has 1.gz,2.gz,3.gz and so on and I want to find the largest uncompressed file size without uncompressing it
I tried the following command but it is not working
find . -type f -name '*.gz' | xargs zcat | xargs ls -1s
I am new to Bash and Linux.
Upvotes: 2
Views: 811
Reputation: 3791
Interrestingly, according to http://www.gzip.org/zlib/rfc-gzip.html
ISIZE (Input SIZE)
This contains the size of the original (uncompressed) input data modulo 2^32.
So the format contains the original size (modulo 2^32, which "ought to be enough for anybody", but of course is not... See warnings below!)... Now we just need a command to output it for us : gzip -l file(s)
: the size is the 2nd argument.
Therefore you DO NOT NEED to uncompress the files at all IF your original files were all less than 4gb in size:
find . -name '*.gz' -print | xargs gzip -l | awk '{ print $2, $4 ;}' | grep -v '(totals)$' | sort -n | tail -1
Which will be a great deal faster than the others solutions I see here ^^
BUT please be warned: for files of size greater than 2^32 , the result will be only "modulo 2^32" (so for example, a file of size "2^32 + 1" bytes will be reported as having a size of 1 byte!). So if you have compressed files that were originally larger than 4Gb, you need to uncompress (on-the-fly if you want) to get their real size!
Edit: I tried to see if the ratio could be used instead of the "original size modulo 2^32" : no...
$ dd if=/dev/zero of=1_gb bs=1048576 count=1024 #creating a 1 Gb file
$ dd if=/dev/zero of=5_gb bs=1048576 count=5120 #creating a 5 Gb file
$ ls -al *gb*
-rw-r--r-- 1 user UsersGrp 1042074 Mar 4 10:30 1_gb.gz
-rw-r--r-- 1 user UsersGrp 5210215 Mar 4 10:28 5_gb.gz
$ gzip -l *gb*
compressed uncompressed ratio uncompressed_name
1042074 1073741824 99.9% 1_gb
5210215 1073741824 99.5% 5_gb
6252289 2147483648 99.7% (totals)
(notice the 2nd: the uncompressed is not 5gb, but 1gb, as it's modulo 2^32 (=4gb) :( )
=> the ratio is unuseable too for files >4gb... ( 5gb/5210215 = 1030 . 1gb/1042074 = 1030 too. So the ratio should be the same. But it seems the ratio is using the "uncompressed" field, and not the original size itself.)
Upvotes: 5
Reputation: 4966
Found nearly the same solution as Olivier Dulac, at the same time, using gzip -l
:
find . -name '*.gz' | xargs gzip -l | tail -n +2 | head -n -1 | sort -k 2 | tail -n 1 | awk '{print $NF}'
Upvotes: 0
Reputation: 2543
If you prefer a oneliner (over ruakh's solution) you can try this:
find . -type f -name '*.gz' -printf "%p " -exec sh -c 'zcat {} | wc -c ' \; | sort -k2 | tail -1
Explanation:
Upvotes: 0
Reputation: 12715
You can try:
find . -type f -name '*.gz' -printf '%s %p\n'|sort -nr|head -n 1
This will sort the *.gz files in descending order of file sizes and then print the 1st file in that list.
Upvotes: 1
Reputation: 183300
Your command does not really make sense; find . -type f -name '*.gz' | xargs zcat
will (if all goes well) write out the contents of all the zip files, but it doesn't make sense to convert the contents of those files to command-line arguments (as xargs
does) and pass them to ls -1s
(which expects its arguments to be filenames).
I do not see a good way to salvage your approach. Instead, I recommend writing a loop:
max_size=0
for file in *.gz ; do
size="$(zcat "$file" | wc -c)"
if (( size > max_size )) ; then
max_size="$size"
largest_file="$file"
fi
done
echo "$largest_file"
Upvotes: 2