Reputation: 7065
I’m designing a data warehouse and am struggling to work out how I should be modelling this scenario. I’ve got Users stored in a Slowly Changing Dimension Type 2 table along these lines:
UserKey | UserID | Language | EffectiveDate | ExpiryDate | IsCurrent |
---|---|---|---|---|---|
1 | 1001 | EN | 2021-01-01 | 9999-12-3 | Y |
2 | 1002 | EN | 2021-07-31 | 2022-01-06 | N |
3 | 1002 | FR | 2022-01-06 | 9999-12-31 | Y |
And a Login fact table like:
LoginKey | UserKey | LoginTime |
---|---|---|
12345 | 2 | 2021-12-25 15:00 |
12399 | 3 | 2022-01-31 18:00 |
Thereby allowing us to report on logins by date by user language setting at the time, etc.
Now I have to consider that each user may have one, none, or many concurrent subscriptions, which I was thinking of modelling in a Type 1 SCD thus:
SubsKey | SubsID | SubsType | UserKey | StartDate | EndDate |
---|---|---|---|---|---|
55501 | SBP501 | Premium | 2 | 2021-08-01 | 2022-08-01 |
55502 | SBB123 | Bonus | 3 | 2022-01-31 | 2023-01-31 |
Is it right for one dimension table to reference the surrogate row key of another like this, or should it rather contain the UserID natural key? It seems unwieldy for the Subs table to have different UserKeys for the two concurrent Subscriptions for the same user like this. Or perhaps, when the third row was added to the Type 2 User table, should all the existing rows in Subs with UserKey=2 have been updated to UserKey=3?
The whole thing doesn't seem to fit comfortably into the classic snowflake pattern, which usually has the one-to-many relationship pointing the other way, as might be the case were Language to be a separate dimension table say, with a one-to-many relation on User.
I'm wrestling with not only in the one-to-many example described (one user has many subscriptions) but also many-to-one relations between SCDT2 tables e.g. If the user's language was stored in a SCDT2 table, should the User dimension use reference the Language ID or the LanguageKey for Language table's current row?
Upvotes: 1
Views: 1014
Reputation: 411
Is it right for one dimension table to reference the surrogate row key of another like this, or should it rather contain the UserID natural key
There is also a third choice: durable surrogate key. So, there are actually two questions:
For #1, Kimball Group recommends using durable surrogates.
IMO this requirement is often an overkill, and I'm ok with a natural key.
On #2, I couldn't find any information in Kimball books. From this discussion, I draw a conclusion that Kimball Group is against snowflake schemas in principle and that's why they don't provide any recommendations on this.
Generally, unless there are more specific reporting/BI requirements, I recommend:
Store both keys, surrogate row and durable (in this case, natural). This provides flexibility for different use cases.
This article models parent-child relationship in SCD2 the same way, they store both parent durable and surrogate row keys.
One exception is when the referenced dimension changes faster than the referencing. For example, in your case, if Users dimension would have a volatile attribute like 'Last LoginTime', then reference it by natural key only.
Model a referencing dimension as SCD2 too
It seems unwieldy for the Subs table to have different UserKeys for the two concurrent Subscriptions for the same user like this
It's solved by storing natural key too.
when the third row was added to the Type 2 User table, should all the existing rows in Subs with UserKey=2 have been updated to UserKey=3?
If you consider Subscriptions as SCD1 (actual data only), then yes.
If you consider it as SCD0 (original data, no changes), then no.
If you model Subscriptions as SCD2, you add new rows instead of updating.
I'm wrestling with not only in the one-to-many example described (one user has many subscriptions) but also many-to-one relations between SCDT2 tables
You always model references from many side to one. In your example, a user doesn't actually reference multiple subscriptions. Vice versa, multiple subscriptions reference one user. So, the same logic is applicable to user languages.
However, generally, I would model Languages as SCD1 and without surrogate keys. Then there are no questions how to reference them.
On the @NickW answer, I disagree with his claims:
"a subscription is a fact"
It maybe sounds like a fact. But there is not enough information to conclude this.
A subscription has ID so it's probable that it's referenced from somewhere else, for example, payments. Then Payments would be facts, and Subscriptions would be a dimension.
If subscriptions are facts, then this table is a time-span fact table. This type of fact tables is advanced and rare, they're not used until proven necessary.
"having the expiry date of one row the same as the effective date of the next row is not a good design"
This design has both advantages and disadvantages:
OVERLAPS
BETWEEN
The classic book on temporal SQL by Richard Snodgrass recommends this design over closed-closed.
Upvotes: 0
Reputation: 111
Based on your examples, the last table looks like more close to SLCD Type 4 than Type 1.
Indeed, I agree that subscriptions might be a Fact table and have a Dimension table. Perhaps, an SLCD Type 2 can be the best option for the subscriptions dimension table but adding a flag column to set the current/active subscription with his associated effective date.
Upvotes: 0
Reputation: 9788
A subscription is a fact and so should be stored in a fact table - though you might also have a subscription dimension that holds attributes of a subscription such as its name. You relate dimensions through fact tables, so your subscription fact would have FKs to Subscription, User, Date etc dimensions.
Relating dimensions directly to each other is called snowflaking and is, generally a bad design.
BTW for an SCD2 table, having the expiry date of one row the same as the effective date of the next row is not a good design. In your example, you would need business logic to define which row was active on 2022-01-06, whereas if a row expires on 2022-01-06 and the next row starts on 2022-01-07 there can be no confusion.
Upvotes: 0