FatCharlie
FatCharlie

Reputation: 1

Using a source surrogate key in a Dimension Table

Fairly new to data modelling and have the scenario below and am unsure on how to proceed with the design

I have a source table with a combination of natural keys and a surrogate key called MED_key, a simple example below:

--Table_Medications

MED_Key | Medication Name | Medication Desc | Class Lvl One | Class Lvl Two
-----------------------------------------------------------------------------
759456123 | Med_ABC1 | Medication for ABC 1 - Long Descriptive Text Here | Calcium | Calcium Type A
781456113 | Med_ABC1 | Medication for ABC 1 - Long Descriptive Text Here | Calcium | Calcium Type B
689476143 | Med_ABC1 | Medication for ABC 1 - Long Descriptive Text Here| Calcium Extra | Calcium Type C
709456129 | Med_ABC1 | Medication for ABC 1 - Long Descriptive Text Here| Calcium Extra | Calcium Type D
989476125 | Med_ABC2 | Medication for ABC 2 - Long Descriptive Text Here| Vit A | Vit A
009456199 | Med_ABC3 | Medication for ABC 3 - Long Descriptive Text Here| Aspirin | Aspirin


--Table_Medication_Prescribed

Trx No | Qty | Amt | MED_Key
-----------------------------
1 | 5 | 23.45 | 781456113 
2 | 3 | 5.10 | 989476125 
3 | 9 | 87.12 | 759456123 

Table_Medication_Prescribed will be the FACT table.

Table_Medications will be the DIM table. When populating the FACT table with the Dimension Key, I feel that I might hit a slow processing time when I do a join using four VARCHAR fields (Name, Description Class Lvl 1/2). I am unsure about using the MED_Key as a "natural key" inside the DIM table, is there any rules about not doing this? Can anyone help with a best approach to this?

Upvotes: 0

Views: 1614

Answers (1)

Fastidious
Fastidious

Reputation: 1319

Here is a good quote from the Kimball Group on using Surrogate Keys in Dimensions:

A dimension table is designed with one column serving as a unique primary key. This primary key cannot be the operational system’s natural key because there will be multiple dimension rows for that natural key when changes are tracked over time. In addition, natural keys for a dimension may be created by more than one source system, and these natural keys may be incompatible or poorly administered. The DW/BI system needs to claim control of the primary keys of all dimensions; rather than using explicit natural keys or natural keys with appended dates, you should create anonymous integer primary keys for every dimension. These dimension surrogate keys are simple integers, assigned in sequence, starting with the value 1, every time a new key is needed. The date dimension is exempt from the surrogate key rule; this highly predictable and stable dimension can use a more meaningful primary key.

Key things to take away from here. The first is not so much of the poorly admin portion, but how you need to claim control of the primary keys for ALL dimensions. While you are not talking specifically about the natural key, you are also talking about the source system surrogate key. I do believe this still applies because the rule for natural keys is because the source system is controlling it. Thus the same rule applies if they are sending you a surrogate key too.

Real World Issue I Ran Into

One issue I ran into with using surrogate keys from a source system I trusted is when the source system reused the surrogate keys across clients. In meaning, another source system was created later on down the road for scaling purposes and had a X number of clients data pushed into that system. Thus, all the keys were reset for only those clients. I had conflicting surrogate keys for two separate clients going into a unified dataset (i.e.: data warehouse)...

However, you can leave that up to your best judgement too if you trust that surrogate key would remain true. But, the moment it doesn't or something changes, it will break the ETL process and you may have issues down the line. Therefore, my recommendation (or answer) is to always claim control over the primary key and generate your own surrogate key for each dimension as simple integers assigned in sequence as Kimball suggests.

Data should flow to a staging table, dimensions should be populated with that surrogate key being generated/managed by your system and then appended back to the FACT with a typical UPDATE JOIN methodology. Joins will happen and don't fret this portion of the process as we all do it.

Hope that helps

Upvotes: 4

Related Questions