Reputation: 956
In data vault 2.0 one hashes the business key and takes this hash as a primary key of the table. Also Link Tables use the hash primary key to create a relationship.
My problem is with hashes that are basically random, the query optimizer cannot apply any good estimation since the statistics - of course - are not usable for randomly distributed data.
So the query optimizer uses weird plans where it wants to sort often (because it thinks there are only 4 rows to sort). Since I am surely not the first one to deal with data vault in sql server, how is this fixable?.
When query optimizer uses an index seek or a join operator it completely misses the row estimation hence chooses ridiculous plans.
I have to pimp them with join hints and query hints such as (FORCE ORDER) to get anything out of it.
What's the common approach for this?
Upvotes: 3
Views: 2479
Reputation: 744
I firmly agree with your conclusion that hashing will make all data that had a structure / order to it totally random, which will make any form of useful database statistics impossible.
I actually did some experimenting on SQL server myself and came to the same conclusion as you did, supported by the Explain Plans
That is why I firmly believe you/we should consider using the concatenated business key as a primary key INSTEAD of hashing it.
Arguments that are give for hashing are in the realm of:
But I have yet to see proof for argument 1: as you are mentioning you lose any useful statistics when joining! Furthermore: a lot of natural business keys I know are actually much SMALLER than 32 characters... I actually have asked a question related to this subject a few days ago...
Then to argument 2: In most MPP NoSQL databases (Key-value, Document, Column Family) the advise is to actually use a good NATURAL (concatenated) key as the sharding key, not a hash. Example: see this advise for Cassandra.
This is why I do not agree with the Data Vault 2 Hashing theory: I have not seen any proof supporting this... It is one of the reasons why I am proposing a new Ensemble modeling approach @ DMZone Berlin in October.
Upvotes: 7
Reputation: 554
Personally I would not hash the BK but would include all the fields in the HUB if it was a compound key. Why would the LINK table be using the hash values, it should be using the HUB_ID which I would always set up as an incrementing value
I would however create and store a HASH in the SAT table as this is the fastest way to check for changes in the ETL process: Hash the values coming in and compare to the hash on the current SAT record- no point in recomputing the hash on the existing record.
Upvotes: 0