Leonardo
Leonardo

Reputation: 11401

Complementary Hash function

Is there a type of Hashing that satisfy the following equation:

Hash(Hash(X)+Y) = Hash(X+Y)

Context:
I'm working with a append-only database that has to be synced across realms.
To guarantee that the sync occurred as expected we hash both databases and compare.
Since the databases are kinda huge the hash function that we use takes a valuable amount of time to compute. So I was wondering: if i already have the hash of a given data X, and the new data Y, if i could hash only Y and "merge" the hashes I could save alot of time...

Upvotes: 1

Views: 265

Answers (4)

mcdowella
mcdowella

Reputation: 19621

http://en.wikipedia.org/wiki/Merkle_tree have been used for this sort of problem (see the lower sections of that URL). Treat your data as the leaves of a tree, and then compute a hash function at the top of the tree from the bottom up, where the hash function computed at a node is hash(A || B), where A and B are the hash functions computed at its children.

Another option would be to produce hashes of the entire database only at intervals, and to distribute hashes of the concatenated data added to it since the last full hash. This is pretty much just a degenerate version of computing and distributing the merkle tree hash and some of the newer values at the right border of the tree as it grows.

Upvotes: 1

rici
rici

Reputation: 241901

One way to solve the actual problem in the (edited) post, as well as something similar to the literal question, is to hash the data in chunks of some convenient size, where convenience is based both on the size of the database and the expected size of an update. In effect, the hash of the data is the concatenation of the hashes of the chunks, and that obeys the equality:

 HASH(x:Y) = HASH(X):HASH(Y)

where : is the concatenation operator.

It's not necessary for the chunks to be of equal sizes if you store the chunk size with the chunk hash. Of course, in that case the hash function is no longer deterministic, and for comparison purposes you need the sequence of chunk sizes in order to compute the updated hash.

For a deterministic hash, you can use a fixed chunk size with a single (shorter) chunk at the end; the complete hash is assembled by prepending the size of the last chunk to the sequence of hashes. In order to compute the updated hash, it is necessary to start hashing at the beginning of the truncated block, which involves a little duplication of effort, but relatively speaking it won't be much.

For a database measured in terabytes, a reasonable chunk size might be 1GB; if the hash is 128 bits, the total size of the hash will be 16kb per terabyte of database, which is relatively trivial. If terabytes exceed your expectation for "kinda huge", adjust the chunk size appropriately :)

Another advantage of this technique is that the chunk hashes can be computed in parallel. If database updates are cached in RAM, parallel hashing can be a big win.

Upvotes: 0

Maciej S
Maciej S

Reputation: 586

If "+" is concatenation then MD5, SHA1, SHA256 (and more) will almost meet this equation. Output of those hash functions is their internal state so you can calculate Hash(X + Y) knowing only Hash(X) and Y. This property of this hash functions are used in Length Extension Attack (in bad designed crypto). Please note that crypto hash functions where designed without taking this vulnerability in mind (except SHA3).

Upvotes: 1

David Eisenstat
David Eisenstat

Reputation: 65508

Given a modulus M, we could take Hash(X) = X mod M. Then

Hash(Hash(X) + Y) = ((X mod M) + Y) mod M = (X + Y) mod M = Hash(X + Y).

This isn't a great hash function, but, unlike the other proposals so far for Hash, it's not completely useless.

It's also essentially the only proposal, since by substituting Y = Z - Hash(X), we get

Hash(Z) = Hash(Z + (X - Hash(X))),

so Hash is invariant under adding integer multiples of X - Hash(X) to its argument and hence under adding multiples of the greatest common divisor of G of X - Hash(X) for all X. Moreover, since G divides X - Hash(X), it follows that Hash is one-to-one on the domain 0..G-1.

Upvotes: 2

Related Questions