John Smith
John Smith

Reputation: 33

Generating unique id for a given list/set/array of unique numbers

i have arrays containing random unique numbers from 0 to integer.max value.

How can i generate a unique id/signature(int) to identify each array uniquely rather than searching through each array and checking each number.

e.g

int[] x = {2,4,8,1,88,12....};
int[] y = {123,456,64,87,1,12...};
int[] z = {2,4,8,1...};
int[] xx = {213,3534,778,1,2,234....};
..................
..................
and so on.

Each array can have different length, but the numbers are not repeated within the array and can be repeated in other arrays. The purpose of unique id for each array is to identify it through the id so that the searching can be made fast. The arrays contain ids of components and the unique signature/id for the array will identify the components contained in it.

Also the id generated sould be same regardless of the order of the values in the array. Like {1,5} and {5,1} should generate the same id.

I have looked up different number paring methods, but the resulting number grows as the length of the array increases to a point it can not fit in an int.

The IDS assigned to components can be adjusted, they don't have to be a sequence of integers as long as a good range of numbers is available. The only requirement is that once an id is generated for an array (a collection of component ids) they should not collide. And can be generated at runtime if the collection in that array changes.

Upvotes: 2

Views: 2340

Answers (3)

rossum
rossum

Reputation: 15685

You want {1, 5} and {5, 1} to have the same ID. That rules out standard hash functions, which will give different results in that situation. One option is to sort the array before hashing. Be aware that cryptographic hashes are slow; you may find that a non-crypto hash like FNV is sufficient. It will certainly be faster.

To avoid sorting simply add all the numbers mod 2^32 or mod 2^64, as @ruakh suggests and accept that you will have a proportion of collisions. Adding in the array length will avoid some collisions: {5, 1} will not match {1, 2, 3} in that case as (2+(5+1)) != (3+(1+2+3)). You may want to test with your real data to see if this gives enough of an advantage.

Upvotes: 0

AbbeGijly
AbbeGijly

Reputation: 1211

This can be approximately solved with a hash function h() with an order-normalization function (such as sort()). A hash function is lossy, since the number of unique hashes (2^32 or 2^64) is smaller than the number of possible variable length sets of integers, resulting in a small chance of two distinct sets having the same ID (hash collision). Typically this won't be a problem if

  • you use a good hash function, and
  • your data set is not ridiculously large.

The order normalization function would ensure that sets {x, y} and {y, x} are hashed to the same value.

For the hash function you have many options, but choose a hash that minimizes collision probability, such as a cryptographic hash (SHA-256, MD5) or if you need bleeding edge performance use MurmurHash3 or other hash du jour. MurmurHash3 can produce an integer as output, while the cryptographic hashes require an extra step of extracting 4 or 8 bytes from the binary output and unpacking to an integer. (Use any consistent selection of bytes such as first or last.)

In pseudocode:

int getId(setOfInts) {
    intList = convert setOfInts to integer list
    sortedIntList = sort(intList)
    ilBytes = cast sortedIntList to byte array
    hashdigest = hash(ilBytes)
    leadingBytes = extract 4 or 8 leading bytes of hashdigest
    idInt = cast leadingBytes to integer
    return idInt
}

Upvotes: 3

ruakh
ruakh

Reputation: 183240

Strictly speaking, what you ask for is not possible: even with arrays of just two elements, there are many more possible arrays (about 261 after ignoring order) than possible signatures (232). And your arrays aren't limited to two elements, so your situation is exponentially worse.

However, if you can accept a low rate of duplicates and false matches, a simple approach is to just add together all of the elements with the + operator (which essentially computes the sum modulo 232). This is the approach taken by java.util.Set<Integer>'s hashCode() method. It doesn't completely eliminate the need to compare whole arrays (because you'll need to detect false matches), but it will radically reduce the number of such comparisons (because very few arrays will match any given array).

Upvotes: 0

Related Questions