D'Arcy Rittich
D'Arcy Rittich

Reputation: 171351

Do any common OS file systems use hashes to avoid storing the same content data more than once?

Many file storage systems use hashes to avoid duplication of the same file content data (among other reasons), e.g., Git and Dropbox both use SHA256. The file names and dates can be different, but as long as the content gets the same hash generated, it never gets stored more than once.

It seems this would be a sensible thing to do in a OS file system in order to save space. Are there any file systems for Windows or *nix that do this, or is there a good reason why none of them do?

This would, for the most part, eliminate the need for duplicate file finder utilities, because at that point the only space you would be saving would be for the file entry in the file system, which for most users is not enough to matter.

Edit: Arguably this could go on serverfault, but I feel developers are more likely to understand the issues and trade-offs involved.

Upvotes: 14

Views: 5987

Answers (6)

Jerry Coffin
Jerry Coffin

Reputation: 490008

It would require a fair amount of work to make this work in a file system. First of all, a user might be creating a copy of a file, planning to edit one copy, while the other remains intact -- so when you eliminate the duplication, the hard link you created that way would have to give COW semantics.

Second, the permissions on a file are often based on the directory into which that file's name is placed. You'd have to ensure that when you create your hidden hard link, that the permissions were correctly applied based on the link, not just the location of the actual content.

Third, users are likely to be upset if they make (say) three copies of a file on physically separate media to ensure against data loss from hardware failure, then find out that there was really only one copy of the file, so when that hardware failed, all three copies disappeared.

This strikes me as a bit like a second-system effect -- a solution to a problem long after the problem ceased to exist (or at least matter). With hard drives current running less than $100US/terabyte, I find it hard to believe that this would save most people a whole dollar worth of hard drive space. At that point, it's hard to imagine most people caring much.

There are file systems that do deduplication, which is sort of like this, but still noticeably different. In particular, deduplication is typically done on a basis of relatively small blocks of a file, not on complete files. Under such a system, a "file" basically just becomes a collection of pointers to de-duplicated blocks. Along with the data, each block will typically have some metadata for the block itself, that's separate from the metadata for the file(s) that refer to that block (e.g., it'll typically include at least a reference count). Any block that has a reference count greater than 1 will be treated as copy on write. That is, any attempt at writing to that block will typically create a copy, write to the copy, then store the copy of the block to the pool (so if the result comes out the same as some other block, deduplication will coalesce it with the existing block with the same content).

Many of the same considerations still apply though--most people don't have enough duplication to start with for deduplication to help a lot.

At the same time, especially on servers, deduplication at a block level can serve a real purpose. One really common case is dealing with multiple VM images, each running one of only a few choices of operating systems. If we look at the VM image as a whole, each is usually unique, so file-level deduplication would do no good. But they still frequently have a large chunk of data devoted to storing the operating system for that VM, and it's pretty common to have many VMs running only a few operating systems. With block-level deduplication, we can eliminate most of that redundancy. For a cloud server system like AWS or Azure, this can produce really serious savings.

Upvotes: 1

Tom Hale
Tom Hale

Reputation: 46715

btrfs supports online de-duplication of data at the block level. I'd recommend duperemove as an external tool is needed.

Upvotes: 2

FRotthowe
FRotthowe

Reputation: 3662

ZFS supports deduplication since last month: http://blogs.oracle.com/bonwick/en_US/entry/zfs_dedup

Though I wouldn't call this a "common" filesystem (afaik, it is currently only supported by *BSD), it is definitely one worth looking at.

Upvotes: 10

Sudhanshu
Sudhanshu

Reputation: 2871

NetApp has supported deduplication (that's what its called in the storage industry) in the WAFL filesystem (yeah, not your common filesystem) for a few years now. This is one of the most important features found in the enterprise filesystems today (and NetApp stands out because they support this on their primary storage also as compared to other similar products which support it only on their backup or secondary storage; they are too slow for primary storage).

The amount of data which is duplicate in a large enterprise with thousands of users is staggering. A lot of those users store the same documents, source code, etc. across their home directories. Reports of 50-70% data deduplicated have been seen often, saving lots of space and tons of money for large enterprises.

All of this means that if you create any common filesystem on a LUN exported by a NetApp filer, then you get deduplication for free, no matter what the filesystem created in that LUN. Cheers. Find out how it works here and here.

Upvotes: 5

blowdart
blowdart

Reputation: 56490

NTFS has single instance storage.

Upvotes: 5

Matt
Matt

Reputation: 10554

It would save space, but the time cost is prohibitive. The products you mention are already io bound, so the computational cost of hashing is not a bottleneck. If you hashed at the filesystem level, all io operations which are already slow will get worse.

Upvotes: 5

Related Questions