Reputation: 892
We are working on a system (on Linux) that has very limited transmission resources. The maximum file size that can be sent as one file is defined, and we would like to send the minimum number of files. Because of this, all files sent are packed and compressed in GZip format (.tar.gz).
There are a lot of small files of different type (binary, text, images...) that should be packed in the most efficient way to send the maximum amount of data everytime.
The problem is: is there a way to estimate the size of the tar.gz file without running the tar utility? (So the best combination of files can be calculated)
Upvotes: 11
Views: 19707
Reputation: 141
Yes, there is a way to estimate tar size before running the command.
tar -czf - /directory/to/archive/ | wc -c
Meaning: This will create the archive as standar output and will pipe it to the wc command, a tool that will count the bytes. The output will be the number of bytes in the archive. Technically, it runs the tool but doesn't save it.
Source: The Ultimate Tar Command Tutorial with 10 Practical Examples
Upvotes: 14
Reputation: 112284
It depends on what you mean by "small files", but generally, no. If you have a large file that is relatively homogenous in its contents, then you could compress 100K or 200K from the middle and use that compression ratio as an estimate for the remainder of the file.
For files around 32K or less, you need to compress it to see how big it will be. Also when you concatenate many small files in a tar file, you will get better compression overall than you would individually on the small files.
I would recommend a simple greedy approach where you take the largest file whose size plus some overhead is less than the remaining space in the "maximum file size". The overhead is chosen to cover the tar header and the maximum expansion from compression (a fraction of a percent). Then add that to the archive. Repeat.
You can flush the compression at each step to see how big the result is.
Upvotes: 3