tuxmania
tuxmania

Reputation: 956

Data Vault 2.0 in SQL Server

In data vault 2.0 one hashes the business key and takes this hash as a primary key of the table. Also Link Tables use the hash primary key to create a relationship.

My problem is with hashes that are basically random, the query optimizer cannot apply any good estimation since the statistics - of course - are not usable for randomly distributed data.

So the query optimizer uses weird plans where it wants to sort often (because it thinks there are only 4 rows to sort). Since I am surely not the first one to deal with data vault in sql server, how is this fixable?.

When query optimizer uses an index seek or a join operator it completely misses the row estimation hence chooses ridiculous plans.

I have to pimp them with join hints and query hints such as (FORCE ORDER) to get anything out of it.

What's the common approach for this?

Upvotes: 3

Views: 2479

Answers (2)

Rogier Werschkull
Rogier Werschkull

Reputation: 744

I firmly agree with your conclusion that hashing will make all data that had a structure / order to it totally random, which will make any form of useful database statistics impossible.

I actually did some experimenting on SQL server myself and came to the same conclusion as you did, supported by the Explain Plans

That is why I firmly believe you/we should consider using the concatenated business key as a primary key INSTEAD of hashing it.

Arguments that are give for hashing are in the realm of:

  1. Joining in Char(32) (the character string of an MD5 hash) is more performant compared to joining on variable character fields
  2. Hashing reduces hotspots in your MPP cluster when writing data

But I have yet to see proof for argument 1: as you are mentioning you lose any useful statistics when joining! Furthermore: a lot of natural business keys I know are actually much SMALLER than 32 characters... I actually have asked a question related to this subject a few days ago...

Then to argument 2: In most MPP NoSQL databases (Key-value, Document, Column Family) the advise is to actually use a good NATURAL (concatenated) key as the sharding key, not a hash. Example: see this advise for Cassandra.

This is why I do not agree with the Data Vault 2 Hashing theory: I have not seen any proof supporting this... It is one of the reasons why I am proposing a new Ensemble modeling approach @ DMZone Berlin in October.

Upvotes: 7

Aaron Reese
Aaron Reese

Reputation: 554

Personally I would not hash the BK but would include all the fields in the HUB if it was a compound key. Why would the LINK table be using the hash values, it should be using the HUB_ID which I would always set up as an incrementing value

I would however create and store a HASH in the SAT table as this is the fastest way to check for changes in the ETL process: Hash the values coming in and compare to the hash on the current SAT record- no point in recomputing the hash on the existing record.

Upvotes: 0

Related Questions