Reputation: 5153
I was reading an article on hash indexing, and it seems that it is similar to the md5
function of PHP, in that that both take a string value and return an integer representing that string, and this representation is consistent. Is this similarity really there, or am I missing anything? Plus has anybody got an idea about the hashing algorithm MySQL employs for hash based index structure?
Upvotes: 1
Views: 703
Reputation: 2211
I'm not pretending to give a complete description on MySQL algo, but there are a few things that may be guessed.
First of all, Hash table wiki is a must-read. Then we have a notice from MySQL documentation:
- They are used only for equality comparisons that use the = or <=> operators (but are very fast). They are not used for comparison
operators such as < that find a range of values. Systems that rely on this type of single-value lookup are known as “key-value stores”; to
use MySQL for such applications, use hash indexes wherever possible.- The optimizer cannot use a hash index to speed up ORDER BY operations. (This type of index cannot be used to search for the next entry in order.)
- MySQL cannot determine approximately how many rows there are between two values (this is used by the range optimizer to decide which index to use). This may affect some queries if you change a MyISAM table to a hash-indexed MEMORY table.
- Only whole keys can be used to search for a row. (With a B-tree index, any leftmost prefix of the key can be used to find rows.)
This points to following (rather common) properties:
MySQL hash function operates on a fixed length "full-key" record (it is a question though, how varchars are treated, e.g. they might be padded with zeros up to the maximum length)
There is a max_heap_table_size
global value and a MAX_ROWS
parameter that engine is likely to use when guessing upper row count for the hash function.
MySQL allows non-unique keys, but warns about proportional slowdowns. At least this may tell that there is no second hash function, but a mere linked list used in Collision resolution.
As for the actual function used, I don't think there is much to tell. MySQL may even use different functions according to some key heuristics (e.g. one for mostly sequential data, such as ID, but another for CHARs), and of course its output is changed according to estimated row count. However, you should only consider hash indices when BTREE cannot afford you good enough performance or you just never ever use any of its advantages, which is, I suppose, a rare case.
UPDATE
A bit into sources: /storage/heap/hp_hash.c
contains a few implementations for hash functions. At least it was a right assumption that they use different techniques for different types, as it comes to TEXT and VARCHAR:
/*
* Fowler/Noll/Vo hash
*
* The basis of the hash algorithm was taken from an idea sent by email to the
* IEEE Posix P1003.2 mailing list from Phong Vo ([email protected]) and
* Glenn Fowler ([email protected]). Landon Curt Noll ([email protected])
* later improved on their algorithm.
*
* The magic is in the interesting relationship between the special prime
* 16777619 (2^24 + 403) and 2^32 and 2^8.
*
* This hash produces the fewest collisions of any function that we've seen so
* far, and works well on both numbers and strings.
*/
I'll try to give a simplified explanation.
ulong nr= 1, nr2= 4;
for (seg=keydef->seg,endseg=seg+keydef->keysegs ; seg < endseg ; seg++)
Every part of a compund key is processed separately, result is accumulated in nr
.
if (seg->null_bit)
{
if (rec[seg->null_pos] & seg->null_bit)
{
nr^= (nr << 1) | 1;
continue;
}
}
NULL values are treated separately.
if (seg->type == HA_KEYTYPE_TEXT)
{
uint char_length= seg->length; /* TODO: fix to use my_charpos() */
seg->charset->coll->hash_sort(seg->charset, pos, char_length,
&nr, &nr2);
}
else if (seg->type == HA_KEYTYPE_VARTEXT1) /* Any VARCHAR segments */
{
uint pack_length= seg->bit_start;
uint length= (pack_length == 1 ? (uint) *(uchar*) pos : uint2korr(pos));
seg->charset->coll->hash_sort(seg->charset, pos+pack_length,
length, &nr, &nr2);
}
So are TEXT and VARCHAR. hash_sort
is presumably some other function that takes collation into account. VARCHARs have a prefixed 1 or 2-byte length.
else
{
uchar *end= pos+seg->length;
for ( ; pos < end ; pos++)
{
nr *=16777619;
nr ^=(uint) *pos;
}
}
And every other type is treated byte-by-byte with mutiplication and xor
.
Upvotes: 2