Reputation: 28751
Content addressable storage systems use the hash of the stored data as the identifier and the address. Collisions are incredibly rare, but if the system is used a lot for a long time, it might happen. What happens if there are two pieces of data that produce the same hash? Is it inevitable that the most recently stored one wins and data is lost, or is it possible to devise ways to store both and allow accessing both?
To keep the question narrow, I'd like to focus on Camlistore. What happens if permanodes collide?
Upvotes: 5
Views: 1507
Reputation: 1303
(Disclaimer: Possibly not related to camlistore)
Hash collisions shouldn't happen, but afaik many hash functions aren't formally proved to be collision-resistant (also quantum computing etc, though some hashes are considered quantum-proof) so we can't completely rule it out, so yeah, afaik, I think this is something a protocol should be ready to deal with.
How I'd handle it:
So, first, we agree on an ordered list of hashing schemes, at least one of which we hope is fully unbreakable. We try to standardise on using the lowest scheme on the list.
As soon as anyone detects a collision, here's what happens:
They send every peer they know the specimens that caused it as a proof that It's Happening, and as soon as anyone receives such a proof, they should also tell everyone they know. And they should all switch to the next unbroken hash function and start porting their data over to it.
Apps that use the distributed storage system freeze up and show a message to the user explaining that It's Happening.
Data is rehashed with the next hash function. This requires a format with self-describing structure (ack, this is the first time I've noticed a strict need for that. IPLD has that though.)
Apps resume working.
Ideally, if we can afford to, we really should all try to assemble a mapping over all known data, from old hash to new hash — a snapshot of the last known meanings of the dead hash addresses — for historical purposes, or for manually converting data that couldn't be converted automatically. Since whoever is breaking the hash probably has limited compute and since we're taking these snapshots as quickly as we can, we can assume that most of these hashes were never aliased and develop a consensus about what they used to mean. Since we're never going to use that hash function again, the snapshot can be locked after a month or so.
Upvotes: 0
Reputation: 442
In an ideal collision-resistant system, when a new file / object is ingested:
So no, it's not inevitable that information is lost in a content-addressable storage system.
* Ideally, the existing stored data would then be re-hashed in the same way, and the original hash entry tagged somehow (e.g. linked to a zero-byte payload) to notate that there were multiple stored objects that originally resolved to that hash (similar in concept to a 'Disambiguation page' on Wikipedia). Whether that is necessary depends on how data needs to be retrieved from the system.
While intentionally causing a collision may be astronomically impractical for a given algorithm, a random collision is possible as soon as the second storage transaction.
Note: Some small / non-critical systems skip the binary comparison step, trading risk for bandwidth or processing time. (Usually, this is only done if certain metadata matches, such as filename or data length.)
The risk profile of such a system (e.g. a single git
repository) is far different than for an enterprise / cloud-scale environment that ingests large amounts of binary data, especially if that data is apparent random binary data (e.g. encrypted / compressed files) combined with something like sliding-window deduplication.
See also, e.g.:
Upvotes: 1
Reputation: 28751
It is assumed that collisions do not happen. Which is a perfectly reasonable assumption, given a strong hash function and a casual, non-malicious user inputs. SHA-1, which is what Camlistore currently uses, is also resistant to malicious attempts to produce collision.
In case a hash function becomes weak with time and needs to be retired, Camlistore supports a migration to a new hash function for new blobrefs, while keeping old blob refs accessible.
If a collision did happen, as far as I understand, the first stored blobref with that hash would win.
source: https://groups.google.com/forum/#!topic/camlistore/wUOnH61rkCE
Upvotes: 2