Simon
Simon

Reputation: 439

Hash Function with Order Preserving

Is there any hash function with uniq hash code (like MD5) with order preserving?

NOTE: i don't care about security, i need it for sorting, i have lot of chunks with (~1MB size) and i want to sort them, of course i can use index sort but i want to reduce time of compare

Theoreticaly: if i have 1'000'000 chunks with 1MB size (1'048'576 byte) and all of them have difference in last 10 bytes then time of compare of one chunk to other will be O(n-10) and if i will use QuictSort (which make ~(nlog2(n)) compares) then total time of compare will be nlog2(n)*(k-10) (where k is chunk size) 1'000'000 * 20 * (1'048'576 - 10)

that's why i want to generate order preserved hash codes with fixed size (for example 16 bytes) once then sort chunks and save result (for example: in file)

Upvotes: 14

Views: 15458

Answers (6)

Goswin von Brederlow
Goswin von Brederlow

Reputation: 12332

Lets construct such a function from the requirements:

  1. You want a function that outputs a 16 byte hash. So you will have collisions. You can't preserve perfect order and you don't want to. Best you can do is:

    H(x) < H(y) => x < y

    H(x) > H(y) => x > y

Values close to each other will have the same hash.

  1. For each x there is an i_x > 0 so that H(x) = H(x + i_x) < H(x + i_x + 1). (Except for the end where x + i_x + 1 would overflow your 1MB chunks.)

Extending that you get: H(x) < H(x + i_x + n) for any n > 0.

Same argument works for j_x > 0 in the other direction. Combine them and you get:

H(x - j_x) == H(x - j_x + 1) == ... == H(x + i_x - 1) == H(x + i_x)

Or in other words for each hash value there is a single segment [a, b] mapping to the same value. No value outside this segment can have the same hash value or the ordering would be violated.

Your hash function can then be described by the segments you choose:

Let a_i be 1MB chunks with 0 <= i < 256^16 and a_i <= a_i+1. Then

H(x) = i where a_i <= x < a_i+1
  1. You want an more of less uniform distribution of hash values. Otherwise one would get far more collisions than another and you would spend all the time doing a full compare when that value is hit. So all the segments [a, b] should be about the same size.

The only way to have exact the same size for each segment is to have

a_i = i * 2 ^ (1MB - 16)

or in other words: H(x) = first 16 bytes of x.

Any other order preserving hash function with a 16 byte output would be less efficient for a random set of input blocks.

And yes, if all but the last few bits of each input block are the same then every test will be a collision. That's a worst case scenario that always exists. If you know your inputs aren't uniformly random then you can adjust the size of each segment to have the same probability to be hit. But that requires knowledge of likely inputs.

Note: If you really want to sort 1'000'000 1MB chunks where you fear such a worst case then you can use bucket sort, resulting in 1,000,000 * 1'048'576 (byte) compares every time. Half of that if you compare 16 bit values at a time, which still has a reasonable number of buckets (65536).

Upvotes: 2

Brian Long
Brian Long

Reputation: 316

CHM (Z.J. Czech, G. Havas, and B.S. Majewski) is an algorithm which generates a minimal perfect hash that preserves ordering (e.g. if A < B, then h(A) < h(B)). It uses approximately 8 bytes of storage per key.

See: http://cmph.sourceforge.net/chm.html

Upvotes: 20

L. Blanc
L. Blanc

Reputation: 2310

According to NIST (I'm no expert) a Pearson hash can be order-preserving. The hash uses an auxiliary table. Such a table can (in theory) be constructed such that the resulting hash is order preserving.

It doesn't meet your full requirements though, because it doesn't reduce the size as you would like. I'm posting this in case other people are looking for a solution.

Some pointers:

Upvotes: 4

Gassa
Gassa

Reputation: 8844

Sorting an array of N strings each of length K can be done in just O (NK) or O (N^2 + NK) character comparisons.

For example, construct a trie.

Or do a kind of insertion sort. Construct the set of sorted strings S by adding strings to it one by one. For each new string P, traverse it, maintaining the (non-decreasing) index of the greatest string Q in S such that Q <= P. When the string P ends, insert it into S just after Q. Each of the O(N) insertions can be done in O(N+K) operations: O(N) times increasing the index distributed into K.


When you have indices of the strings in sorted order, just use them for your purposes instead of the "hashes" you want.

Upvotes: 2

mihai.ciorobea
mihai.ciorobea

Reputation: 741

In theory there is no such thing. If you want, you can create a composed hash:

index:md5

I think this will resolve your needs.

Upvotes: -3

Gassa
Gassa

Reputation: 8844

In general case, such a function is impossible unless the size of the hash is at least the size of the object.

The argument is trivial: if there are N objects but M < N hash values, by pigeonhole principle, two different objects are mapped to one hash value, and so their order is not preserved.

If however we have additional properties of the objects guaranteed or the requirements relaxed, a custom or probabilistic solution may become possible.

Upvotes: 6

Related Questions