Hello lad
Hello lad

Reputation: 18790

Should I add hashkey in staging area or in ETL from staging area to DWH?

We plan the EDW in the following structure.

Operational System --> Staging Area --> EDW (data vault)

According to data vault law it is required to hash the business key. Should I implement the hashing in the Staging area or in the ETL(from staging area to EDW).

How do people usually deal with adding metadata in staging area ?

Upvotes: 0

Views: 891

Answers (1)

tobi6
tobi6

Reputation: 8239

  • When hashing in staging, it is possible to prepare your data to be split in Hubs, Satellites and Links. When loading data into the staging area, you only need to calculate hashes once (e.g. CustomerHashKey, CustomerHashDiff, CustomerContractLinkHashKey). Besides those hashes, a Sequence number, a Load Date and a Source reference, no additional metadata is put into the staging area. This gives a bit more clarity as in what metadata is added at staging.

  • In the next step, loading from staging into the data warehouse simply SELECT the source data and put it into the desired structure (e.g. HubCustomer, SatCustomer, LinkCustomerContract). Since the hashes have already been generated once, no additional processing is needed when loading. Also, there is only one point where hashes are being generated - in the staging processes. All other processes downstream can use them.

If you would generate the hash from staging into the data warehouse, you might end up generating those hashes multiple times (Hub, Satellite (+diff), Link) which could become a performance problem for Big Data use cases or machines with weak CPUs. Think of the HashDiff on satellites: if the hashing happens in the staging area, no additional processing is needed when inserting into the data warehouse (which could get expensive quickly when there are a lot of columns in a table).

So I would always hash in the staging area if possible.

Upvotes: 1

Related Questions