Reputation: 979

Data deduplication

I want to incrementally backup all my ms office, pdf, html, xml files to a shared network. I will be reading the files in chunk of 5mb also i will be doing MD5 on that data to consider de dupe factor. My question is say a particular files gets modified after uploading and now i want to incrementally backup the changed data and if i consider the same chunking size then all the chunks will appear to be different and i need to upload them all again. So is there any better approach for de duplication, or will it be better to know the structures (raw reading) of all the specified files and then working on de dupe?

Upvotes: 3

Answers (2)

Clay

Reputation: 5084

There are a number of approaches to de-duplication. Having to know the internal structure of a file is probably the least attractive option - at least to me. We did something similar to what you're asking for and built a product around it.

A couple of observations; first, and you've probably heard enough of this, given the age of the question, MD5 isn't your best friend. The probability of a collision to too high for use in these types of applications. We chose SHA-1, as have a lot of other products that do similar work.

You recognized the problem with simple "chunking" of the data...that all subsequent chunks might have to be rewritten in the case of an early-in-file insert.

First, you will probably recognize that below a certain size threshold, this matters very little. The IO for changed smaller files is something you will probably just absorb.

But, with larger files, it would be nice to not have to rewrite all the data if only a small portion changes...and with many large writable files, small changes in a large otherwise static data set is exactly what happens. Files that have internal database-ish structure for example.

The trick, if it can be considered a trick, is to recognize ranges of data that are static. What this amounts to is calculating hashes on the data that you recognize.

For example, imagine calculating a rolling hash, byte by byte on the file as you go through it. If your hashing function is reasonably distributive, each new byte will result in a hash value that's pretty randomly distant from the previous byte's contribution.

Recognizing a hash simply means that the hash value is in a certain subset of values that you choose arbitrarily...that you've decided represent sentinel hashes. For example, you might recognize all that hashes that divide evenly by a constant value.

When you recognize a hash, you capture the offset of that byte in the file, and reset your rolling hash to it's initial state. At the end of the file, you'll have accumulated a list of offsets.

Now, relative distance between those offsets is going to be controlled by how selective you've been with your hash recognizer. If you choose, for example, to recognize hash % 32 == 0, you're going to have a lot of offsets at small relative distances from each other. If you have hash % 65536 == 0 you're going to have a lot fewer, more widely spaced offsets. The distance between each offset will be variable...some will be quite small and some will be very large. Note: large chunks are very compressible.

These offsets are going to be the break points...you'll store chunks from offset to offset. Before you store a chunk, you'll calculate the hash of that chunk (a SHA-1 hash, not the running hash). If you've already got that chunk in your storage, you won't need to store it again. In your storage, files become lists of chunks. Chunks might initially come from one file, but get recognized as occurring in other files, too. De-duplication!

The selectivity you apply to the running hash controls not only the chunk size, but also, how well you capture "small changes" in large files.

It's important, at this point, to distinguish between a running hash and a rolling hash. It's really important that what you're calculating, as you roll a long through the file, is a hash over the last n bytes, where n is the width of the sliding frame. We're not calculating a hash from offset to offset. We're trying to find n-byte sentinels that we recognize.

The size of n is important, too. You'll be calculating a hash for 0 through n-1, then 1 through n, then 2 through n+1, etc. If you think about it, n represents the smallest chunk size that will exist (except for end-of-file that come right after a sentinel).

So, at this point, you'd have to be thinking, "Holey , that's a lot of hashing!", and you're right; but it's not as bad as you might think. Choosing the correct rolling hash algorithm is super important. There's an algorithm extremely well suited for this. The one we use is called the Rabin-Karp rolling hash and uses a Rabin fingerprint to discover sentinel offsets, and it's beauty is that adding a byte's contribution and removing a byte's contribution is trivial, inexpensive arithmetic.

The reason the rolling hash is important (as opposed to a running hash) is change detection. Assume a recognized sentinel offset occurs before a change...and then another recognized sentinel occurs after the change. Only the chunk between these two offsets will be stored. The part before the change and the part after the change will have been previously stored.

Upvotes: 2

Vikram

Reputation: 41

You can check up rsync and its algorithm.

Otherwise you may have to do something like what datadomain does. Checksumming variable chunksizes so that data segments may be identified independently of their offset in a given file. Please search on the net to look into their patents etc. Giving a fullwrite up here is not possible.

Upvotes: 0

Data deduplication

Answers (2)

Related Questions