Reputation: 21
I would like to know how I could optimize my hash function. for the moment I use the mid-square method; but I would like the algorithm to be able to generate a hash like the sha 512 method, this is mainly to avoid collisions. Currently I have 1/3 collision rate despite using chained lists. Also don't change the prototype of the new function . This is my function :
int hash(char *key, int len)
{
unsigned int sum = 0;
unsigned int square;
unsigned int indexe;
for (int i = 0; i < len; i++) {
sum += (unsigned int)key[i];
}
square = sum * sum;
indexe = square % 1000000;
if (indexe < 100000) {
indexe += 100000;
}
return indexe;
}
I have already tried to optimize with other methods but the percentage of collisions is always the same; and i don't know how to do it with SHA 512 method. I also want the collision rate to be less than 5 percent
Upvotes: 0
Views: 116
Reputation: 601
In this case, what you can try one approach integrating SHA-512
into the hash function to minimize collisions:
SHA-512 Approach
#include <openssl/sha.h>
int hash(char *key, int len) {
unsigned char digest[SHA512_DIGEST_LENGTH];
SHA512((unsigned char *)key, len, digest);
unsigned int indexe = (digest[0] << 24) | (digest[1] << 16) |
(digest[2] << 8) | digest[3];
indexe %= 1000000;
if (indexe < 100000) {
indexe += 100000;
}
return indexe;
}
but, directly integrating SHA-512 into a hash function designed for simplistic keys or small hash tables might lead to inefficiencies, e.g. overhead for small hash tables, unevenness truncation ... So, what you can also try Fowler-Noll-Vo hash approach
for lightweight tables alternatively.
Fowler-Noll-Vo hash approach
// 32-bit offset basis FNV-1a algorithm
#define FNV_OFFSET_BASIS 2166136261u
#define FNV_PRIME 16777619u
int hash(char *key, int len) {
uint32_t hash = FNV_OFFSET_BASIS;
for (int i = 0; i < len; i++) {
hash ^= (unsigned char)key[i];
hash *= FNV_PRIME;
}
int indexe = hash % 1000000;
if (indexe < 100000) {
indexe += 100000;
}
return indexe;
}
Upvotes: -1
Reputation: 154208
How can i optimise my hash function using SHA method?
OP is focused on the pre-hash step (e.g. optimizing SHA), yet is glossing over the impact of the mod step.
Avoid mod by a non-prime.
Common hashing involves forming a value with a set of bits, the pre-hash, called sum
below, that is highly scattered and somewhat evenly distributed given various inputs.
// Generalized hash generation
size_t hash(size_t len, const char *key, size_t table_size) {
size_t sum = 0; // or maybe a wider type.
for (size_t i = 0; i < len; i++) {
sum = foo(sum, key[i], i);
}
// At this point, `sum` bits should appear to have a _random_ distribution.
return sum % table_size;
}
So even with a good pre-hash, the next step (the mod) is important too.
This issue can apply without regard to the algorithm used for the pre-hash.
If the table size was 256, then only the least 8-bits of sum
would contribute to the returned index.
This is OK if the pre-hash was really good as any bits are just as random as the other.
Modding by 1000000 (0xF4240) favors the last least 6 bits.
Yet if the pre-hash had weaknesses, modding by a prime number instead would use all of the pre-hash's bits in forming the index -likely an improvement. This prime modding does not significantly weaken a good pre-hash.
Upvotes: 0
Reputation: 109613
Others already mentioned the overkill of SHA-512.
This hash would do - as vague idea - something like:
hash(65):
take square; // 4225
take middle part; // 22
returns 22
More realistically:
int hash(uint8 n) {
uint16 sq = n * n;
unint8 middle = sq >>> 4;
return middle;
}
Nice would be to utilize how large the hash bucket size /array length N
is.
(Because above - 22 - it would not make sense to have an array larger than 100.)
Here N seems to be 100000.
If the hash code gives a uniform range (0, M>, and then for index in the hash array % N is taken, the distribution is no longer uniform.
And then the hashing must avoid pitfalls: the values are typically 32 - 127, the order of the chars int the array must be significant.
You could do something like
for (int i = 0; i < len; i++) {
unsigned int val = (unsigned int)key[i];
sum += ((sum << 7) | val) ^ (sum & 0x7fu);
}
before taking the square to randomize.
But you might rather be interested in CRC checksum algorithms.
Upvotes: 0
Reputation: 1107
A good approach is to compute a robust hash (like FNV-1a
or MurmurHash3
) on the string, then reduce it mod your table size.
#define FNV_PRIME_32 16777619u
#define FNV_OFFSET_32 2166136261u
int hash(char *key, int len)
{
unsigned int h = FNV_OFFSET_32;
for(int i = 0; i < len; i++) {
h ^= (unsigned char)key[i];
h *= FNV_PRIME_32;
}
unsigned int index = h % 1000000;
if(index < 100000) {
index += 100000;
}
return index;
}
FNV-1a
is very fast for short strings and tends to have fewer collisions than a naive sum or mid-square method.
Computing the full SHA-512 is likely overkill for a typical hash table, and its much slower than FNV or MurmurHash. But if you still want to do it, you can link to an existing library like OpenSSl or lightweight SHA implementations, then do something like this:
#include <openssl/sha.h>
int hash(char *key, int len)
{
unsigned char digest[SHA512_DIGEST_LENGTH]; // 64 bytes
SHA512((unsigned char*)key, len, digest);
unsigned long long hash64 = 0;
for (int i = 0; i < 8; i++) {
hash64 = (hash64 << 8) | digest[i];
}
unsigned int index = (unsigned int)(hash64 % 1000000ULL);
if(index < 100000)
index += 100000;
return index;
}
Upvotes: 1