user7610
user7610

Reputation: 28751

How do content addressable storage systems deal with possible hash collisions?

Content addressable storage systems use the hash of the stored data as the identifier and the address. Collisions are incredibly rare, but if the system is used a lot for a long time, it might happen. What happens if there are two pieces of data that produce the same hash? Is it inevitable that the most recently stored one wins and data is lost, or is it possible to devise ways to store both and allow accessing both?

To keep the question narrow, I'd like to focus on Camlistore. What happens if permanodes collide?

Upvotes: 5

Views: 1507

Answers (4)

mako
mako

Reputation: 1303

(Disclaimer: Possibly not related to camlistore)

Hash collisions shouldn't happen, but afaik many hash functions aren't formally proved to be collision-resistant (also quantum computing etc, though some hashes are considered quantum-proof) so we can't completely rule it out, so yeah, afaik, I think this is something a protocol should be ready to deal with.

How I'd handle it:

So, first, we agree on an ordered list of hashing schemes, at least one of which we hope is fully unbreakable. We try to standardise on using the lowest scheme on the list.

As soon as anyone detects a collision, here's what happens:

  • They send every peer they know the specimens that caused it as a proof that It's Happening, and as soon as anyone receives such a proof, they should also tell everyone they know. And they should all switch to the next unbroken hash function and start porting their data over to it.

  • Apps that use the distributed storage system freeze up and show a message to the user explaining that It's Happening.

  • Data is rehashed with the next hash function. This requires a format with self-describing structure (ack, this is the first time I've noticed a strict need for that. IPLD has that though.)

  • Apps resume working.

  • Ideally, if we can afford to, we really should all try to assemble a mapping over all known data, from old hash to new hash — a snapshot of the last known meanings of the dead hash addresses — for historical purposes, or for manually converting data that couldn't be converted automatically. Since whoever is breaking the hash probably has limited compute and since we're taking these snapshots as quickly as we can, we can assume that most of these hashes were never aliased and develop a consensus about what they used to mean. Since we're never going to use that hash function again, the snapshot can be locked after a month or so.

Upvotes: 0

Jim Grisham
Jim Grisham

Reputation: 442

In an ideal collision-resistant system, when a new file / object is ingested:

  1. A hash is computed of the incoming item.
  2. If the incoming hash does not already exist in the store:
    1. the item data is saved and associated with the hash as its identifier
  3. If incoming hash does match an existing hash in the store:
    1. The existing data is retrieved
    2. A bit-by-bit comparison of the existing data is performed with the new data
    3. If the two copies are found to be identical, the new entry is linked to the existing hash
    4. If the new copies are not identical, the new data is either
      1. Rejected, or
      2. Appended or prefixed* with additional data (e.g. a timestamp or userid) and re-hashed; this entire process is then repeated.

So no, it's not inevitable that information is lost in a content-addressable storage system.

* Ideally, the existing stored data would then be re-hashed in the same way, and the original hash entry tagged somehow (e.g. linked to a zero-byte payload) to notate that there were multiple stored objects that originally resolved to that hash (similar in concept to a 'Disambiguation page' on Wikipedia). Whether that is necessary depends on how data needs to be retrieved from the system.


While intentionally causing a collision may be astronomically impractical for a given algorithm, a random collision is possible as soon as the second storage transaction.


Note: Some small / non-critical systems skip the binary comparison step, trading risk for bandwidth or processing time. (Usually, this is only done if certain metadata matches, such as filename or data length.)

The risk profile of such a system (e.g. a single git repository) is far different than for an enterprise / cloud-scale environment that ingests large amounts of binary data, especially if that data is apparent random binary data (e.g. encrypted / compressed files) combined with something like sliding-window deduplication.


See also, e.g.:

Upvotes: 1

eltronix
eltronix

Reputation: 61

Composite Key e.g hash + userId

Upvotes: 1

user7610
user7610

Reputation: 28751

It is assumed that collisions do not happen. Which is a perfectly reasonable assumption, given a strong hash function and a casual, non-malicious user inputs. SHA-1, which is what Camlistore currently uses, is also resistant to malicious attempts to produce collision.

In case a hash function becomes weak with time and needs to be retired, Camlistore supports a migration to a new hash function for new blobrefs, while keeping old blob refs accessible.

If a collision did happen, as far as I understand, the first stored blobref with that hash would win.

source: https://groups.google.com/forum/#!topic/camlistore/wUOnH61rkCE

Upvotes: 2

Related Questions