Reputation: 18790
We plan the EDW in the following structure.
Operational System --> Staging Area --> EDW (data vault)
According to data vault law it is required to hash the business key. Should I implement the hashing in the Staging area or in the ETL(from staging area to EDW).
How do people usually deal with adding metadata in staging area ?
Upvotes: 0
Views: 891
Reputation: 8239
When hashing in staging, it is possible to prepare your data to be split in Hubs
, Satellites
and Links
. When loading data into the staging area, you only need to calculate hashes once (e.g. CustomerHashKey
, CustomerHashDiff
, CustomerContractLinkHashKey
). Besides those hashes, a Sequence
number, a Load Date
and a Source reference
, no additional metadata is put into the staging area. This gives a bit more clarity as in what metadata is added at staging.
In the next step, loading from staging into the data warehouse simply SELECT
the source data and put it into the desired structure (e.g. HubCustomer
, SatCustomer
, LinkCustomerContract
). Since the hashes have already been generated once, no additional processing is needed when loading. Also, there is only one point where hashes are being generated - in the staging processes. All other processes downstream can use them.
If you would generate the hash from staging into the data warehouse, you might end up generating those hashes multiple times (Hub
, Satellite
(+diff), Link
) which could become a performance problem for Big Data use cases or machines with weak CPUs. Think of the HashDiff
on satellites: if the hashing happens in the staging area, no additional processing is needed when inserting into the data warehouse (which could get expensive quickly when there are a lot of columns in a table).
So I would always hash in the staging area if possible.
Upvotes: 1