Reputation: 221
I need to find out the best way from performance and storage point of view for storing the hash, for example, MD5. The current database is MariaDB in the future it could be Oracle. The table will contain hundreds of millions or billions record and each record should include one hash value, that could be used for the search. At this time for storing the hash I use varchar(32), but I think this type is not optimal. I also think about type like binary, char, also as I understand for stable sizes, like in hash, is better to used char instead of varchar and binary instead of varbinary. Also, I think about converting the hash to digits, will it be better? So what is the best way of storing the hash in Database?
Upvotes: 4
Views: 11914
Reputation: 142198
MySQL/MariaDB: BINARY(16)
. It occupies 16 bytes, and is sufficient for MD5. Clearly you need an INDEX
on the column.
But let me point out a design flaw in using hashes...
If you have a billion rows, but cannot cache more than a fraction of them, then any lookup is very likely to require a disk hit. This is because of the randomness of MD5 (or UUID or ...). The in-RAM cache (InnoDB's buffer pool, in the case of MySQL/MariaDB) is unlikely to have the block containing the next value you need.
Do the math. How fast can a disk block (that is not cached) be read? A little bit of arithmetic on that gives you how few reads/second you can perform. A spinning drive: 10ms --> 100 reads/sec. Multi-threading will not help. RAID striping will help, some.
Similarly, INSERTing
is limited to about the same amount. Early in inserting a billion rows, things will be fast due to caching; later it will slow down to 100 rows/sec. Inserting a billion rows will take months.
PARTITIONing
will not improve performance.
You may need code (either in SQL or your app) to convert between whatever the function delivers and BINARY
(which is similar to BLOB
).
Upvotes: 1
Reputation: 21043
The short answer is each datatype should be stored in the native format supported by the RDBMS.
This is for Oracle RAW(16)
for the MD5 hash code.
You see this analogy - some have decided to store DATE
columns using VARCHAR format. You get
a database independence but you can't use any function provided for DATE
columns by your RDBMS
.
Anyway you should carefully consider why use a HASH column in a database.
If it is a substitute to quickly recognise a change of row columns, it could save you a lot of coding and processing.
To use a hash code as a key, try to find at least one reason, that you will profit from a HASH key, compared with a use of a sequence generated key.
Be carefull not to use HASH as a key only because this is recomended from some source before you see in your own imlementation the described positive effect.
Upvotes: 3
Reputation: 167774
In Oracle, use the RAW
data type for binary data up to 4000 bytes and BLOB
for larger values.
If your hash function generates a number then you can use the UTL_RAW.CAST_FROM_NUMBER
function to convert it to a RAW
data type.
Upvotes: 0