stovroz
stovroz

Reputation: 7065

Relations between slowly changing dimensions in a data warehouse

I’m designing a data warehouse and am struggling to work out how I should be modelling this scenario. I’ve got Users stored in a Slowly Changing Dimension Type 2 table along these lines:

UserKey UserID Language EffectiveDate ExpiryDate IsCurrent
1 1001 EN 2021-01-01 9999-12-3 Y
2 1002 EN 2021-07-31 2022-01-06 N
3 1002 FR 2022-01-06 9999-12-31 Y

And a Login fact table like:

LoginKey UserKey LoginTime
12345 2 2021-12-25 15:00
12399 3 2022-01-31 18:00

Thereby allowing us to report on logins by date by user language setting at the time, etc.

Now I have to consider that each user may have one, none, or many concurrent subscriptions, which I was thinking of modelling in a Type 1 SCD thus:

SubsKey SubsID SubsType UserKey StartDate EndDate
55501 SBP501 Premium 2 2021-08-01 2022-08-01
55502 SBB123 Bonus 3 2022-01-31 2023-01-31

Is it right for one dimension table to reference the surrogate row key of another like this, or should it rather contain the UserID natural key? It seems unwieldy for the Subs table to have different UserKeys for the two concurrent Subscriptions for the same user like this. Or perhaps, when the third row was added to the Type 2 User table, should all the existing rows in Subs with UserKey=2 have been updated to UserKey=3?

The whole thing doesn't seem to fit comfortably into the classic snowflake pattern, which usually has the one-to-many relationship pointing the other way, as might be the case were Language to be a separate dimension table say, with a one-to-many relation on User.

Edit

I'm wrestling with not only in the one-to-many example described (one user has many subscriptions) but also many-to-one relations between SCDT2 tables e.g. If the user's language was stored in a SCDT2 table, should the User dimension use reference the Language ID or the LanguageKey for Language table's current row?

Upvotes: 1

Views: 1014

Answers (3)

Basil Peace
Basil Peace

Reputation: 411

Is it right for one dimension table to reference the surrogate row key of another like this, or should it rather contain the UserID natural key

There is also a third choice: durable surrogate key. So, there are actually two questions:

  1. What to use as a durable key: natural key or durable surrogate key?
  2. What to use to reference SCD2: durable key or surrogate row key?

For #1, Kimball Group recommends using durable surrogates.
IMO this requirement is often an overkill, and I'm ok with a natural key.

On #2, I couldn't find any information in Kimball books. From this discussion, I draw a conclusion that Kimball Group is against snowflake schemas in principle and that's why they don't provide any recommendations on this.

Generally, unless there are more specific reporting/BI requirements, I recommend:

  • Store both keys, surrogate row and durable (in this case, natural). This provides flexibility for different use cases.
    This article models parent-child relationship in SCD2 the same way, they store both parent durable and surrogate row keys.

    One exception is when the referenced dimension changes faster than the referencing. For example, in your case, if Users dimension would have a volatile attribute like 'Last LoginTime', then reference it by natural key only.

  • Model a referencing dimension as SCD2 too

It seems unwieldy for the Subs table to have different UserKeys for the two concurrent Subscriptions for the same user like this

It's solved by storing natural key too.

when the third row was added to the Type 2 User table, should all the existing rows in Subs with UserKey=2 have been updated to UserKey=3?

If you consider Subscriptions as SCD1 (actual data only), then yes.
If you consider it as SCD0 (original data, no changes), then no.
If you model Subscriptions as SCD2, you add new rows instead of updating.

I'm wrestling with not only in the one-to-many example described (one user has many subscriptions) but also many-to-one relations between SCDT2 tables

You always model references from many side to one. In your example, a user doesn't actually reference multiple subscriptions. Vice versa, multiple subscriptions reference one user. So, the same logic is applicable to user languages.
However, generally, I would model Languages as SCD1 and without surrogate keys. Then there are no questions how to reference them.

On the @NickW answer, I disagree with his claims:

  • "a subscription is a fact"

    It maybe sounds like a fact. But there is not enough information to conclude this.
    A subscription has ID so it's probable that it's referenced from somewhere else, for example, payments. Then Payments would be facts, and Subscriptions would be a dimension.
    If subscriptions are facts, then this table is a time-span fact table. This type of fact tables is advanced and rare, they're not used until proven necessary.

  • "having the expiry date of one row the same as the effective date of the next row is not a good design"

    This design has both advantages and disadvantages:

    • it allows using OVERLAPS
    • but it disallows using BETWEEN
    • when used with datetime, it doesn't require subtracting one second/milli/micro/nanosecond from the end and so is simpler and error-free

    The classic book on temporal SQL by Richard Snodgrass recommends this design over closed-closed.

Upvotes: 0

Data Slugger
Data Slugger

Reputation: 111

Based on your examples, the last table looks like more close to SLCD Type 4 than Type 1.

Indeed, I agree that subscriptions might be a Fact table and have a Dimension table. Perhaps, an SLCD Type 2 can be the best option for the subscriptions dimension table but adding a flag column to set the current/active subscription with his associated effective date.

Upvotes: 0

NickW
NickW

Reputation: 9788

A subscription is a fact and so should be stored in a fact table - though you might also have a subscription dimension that holds attributes of a subscription such as its name. You relate dimensions through fact tables, so your subscription fact would have FKs to Subscription, User, Date etc dimensions.

Relating dimensions directly to each other is called snowflaking and is, generally a bad design.

BTW for an SCD2 table, having the expiry date of one row the same as the effective date of the next row is not a good design. In your example, you would need business logic to define which row was active on 2022-01-06, whereas if a row expires on 2022-01-06 and the next row starts on 2022-01-07 there can be no confusion.

Upvotes: 0

Related Questions