Reputation: 2421
I have a directory I’m archiving:
$ du -sh oldcode
1400848
$ tar cf oldcode.tar oldcode
So the directory is 1.4gb. The file is significantly smaller, though:
$ ls -l oldcode.tar
-rw-r--r-- 1 ieure ieure 940339200 2002-01-30 10:33 oldcode.tar
Only 897mb. It’s not compressed in any way:
$ file oldcode.tar
oldcode.tar: POSIX tar archive
Why is the tar file smaller than its contents?
Upvotes: 24
Views: 5049
Reputation: 30647
There are 2 possibilities.
Most likely, it isn't smaller than its contents. As Nils Pipenbrinck wrote, du
displays the amount of space the filesystem allocates, which since files are stored in filesystem blocks is more than the logical size of the file.
To view the logical size of the file, use du --apparent-size
. In this case, the result should be smaller than the tar file.
Tar files can store sparse files. If the tarball was created using --sparse
, the holes in the sparse files will be recorded, so the tarball could be smaller than the logical size of the files.
If the sparseness information in your extracted copy was somehow lost (e.g. if you extracted the tarball onto a filesystem that doesn't support sparse files, or if it was zipped and then unzipped, etc.), then df
will report the expanded size.
Upvotes: 3
Reputation: 403
This has something to do with the blocksize of your filesystem. man 1 du on MacOSX 10.5.6 states:
The du utility displays the file system block usage for each file argument and for each directory in the file hierarchy rooted in each directory argument. If no file is specified, the block usage of the hierarchy rooted in the current directory is displayed.
[mirko@borg foo]$ ls -la
total 0
drwxr-xr-x 2 mirko wheel 68 Jan 30 21:20 .
drwxrwxrwt 10 root wheel 340 Jan 30 21:16 ..
[mirko@borg foo]$ du -sh
0B .
[mirko@borg foo]$ touch foo
[mirko@borg foo]$ ls -la
total 0
drwxr-xr-x 3 mirko wheel 102 Jan 30 21:20 .
drwxrwxrwt 10 root wheel 340 Jan 30 21:16 ..
-rw-r--r-- 1 mirko wheel 0 Jan 30 21:20 foo
[mirko@borg foo]$ du -sh
0B .
[mirko@borg foo]$ echo 1 > foo
[mirko@borg foo]$ ls -la
total 8
drwxr-xr-x 3 mirko wheel 102 Jan 30 21:20 .
drwxrwxrwt 10 root wheel 340 Jan 30 21:16 ..
-rw-r--r-- 1 mirko wheel 2 Jan 30 21:20 foo
[mirko@borg foo]$ du -sh
4.0K .
As you see even a file of 2 bytes takes a whole block of 4kb. There are some filesystems which avoid this waste of space by block suballocation.
Upvotes: 3
Reputation: 57076
Having no knowledge of what tar you're using or what sort of Unix system you're using, here's my guess: oldcode contains numerous smaller files, which when by themselves use disk space inefficiently, since disk space is allocated by some sort of block, rather than byte by byte. In the tar file, they're concatenated, and make maximum use of the disk space they're assigned.
Upvotes: 4
Reputation: 86443
You get a difference because of the way the filesystem works.
In a nutshell your disk is made out of clusters. Each cluster has a fixed size of - let's say - 4 kilobytes. If you store a 1kb file in such a cluster 3kb will be unused. The exact details vary with the kind of file-system that you use, but most file-systems work that way.
3kb wasted space is not much for a single file, but if you have lots of very small files the waste can become a significant part of the disk usage.
Inside the tar-archive the files are not stored in clusters but one after another. That's where the difference comes from.
Upvotes: 45