Navin
Navin

Reputation: 124

Surrogate key for date dimension?

There are 2 schools of thought:

  1. Use a surrogate key, preferably in the format: YYYYMMDD as this will always be sequential.
  2. Eliminate the Date dimension surrogate key and use actual date instead.

My Questions to experts on dimension modeling are:

  1. Which design would you prefer and why?
  2. How should we handle unknown values in each of these cases? Can we simply place NULL in Fact table for unknown dates as Foreign Keys can be NULL (if not why)?
  3. If we need to partition fact table on date column, how would we achieve that in case #1?

I am inclined towards using actual dates and using NULL to represent UNKNOWN dates in fact table, as date related validation on fact tables can be done without need to look in to the dimension table.

Upvotes: 3

Views: 6131

Answers (3)

Oracle is unusual in that its Date data type includes time, and it takes 7 bytes. Other platforms usually have an independent data type for bare dates. SQL Server's date takes 3 bytes, PostgreSQL's takes 4, and DB2's takes 4 (I think).

When you use an integer in the format YYYYMMDD, you have to write additional code somewhere to make sure the values are valid dates. 20121332 is a valid integer, but it doesn't represent a valid date. You don't have to write validation code like that for a date data type.

When it comes to recording missing information in a single column, there's not much difference between the two choices. You either use dates that don't mean what the other dates mean, or you use integers that don't mean what the other integers mean. Encoding reasons in little number--where '1' might mean 'not provided yet', for example--means you have to store values that aren't valid date integers in the same column with values that are valid date integers.

But I think it makes more sense to separate the fact that some data is missing from the reason that data is missing. That means storing more data, and some people are reluctant to store more data.


I've found these two IBM Redbooks useful in designing data warehouses.

Anchor modeling is a more recent development. Most of the tables in a schema designed this way will be in 6NF.

Upvotes: 4

ConcernedOfTunbridgeWells
ConcernedOfTunbridgeWells

Reputation: 66662

In order you ask:

  1. Kimball bangs on about surrogate keys for dates, but I've yet to see a compelling technical argument in favour of doing it this way. Converting to a YYYYMMDD format means you have to convert the dates or join against the date dimension to do date arithmetic. Both of these have various ways that they can screw with query plans by .
    A datetime is 8 bytes on SQL Server and (IIRC) 7 bytes on Oracle, so it is a bit wider than an integer surrogate, but I don't see merit in this argument any more unless you have extremely large data volumes. The optimiser just treats the date as a numeric value behind the scenes.

  2. I've had requirements for 'special' values of one type or another. Depending on how you want them to sort you can use various values. In the past I've used this scheme on a number of occasions:

    • 1800-01-01 for 'previous'. This will sort at the beginning unless you need dates earlier than this.
    • 9000-01-01 for 'ongoing'. This will sort at the end.
    • 9100-01-01 for 'unknown'. This will sort at the end.
    • 9200-01-01 for 'error'. This will sort at the end.
  3. Range partitioning on any DBMS platform that supports it (which includes pretty much any mainstream RDBMS platform) will work just fine on either date or integer partition keys.

I don't recommend using NULL for unknown values in a data warehouse as it will require outer joins to use the data. This will affect query plan efficiency and puts a trap in the data for inexperienced players. NULL keys in a data warehouse are bad mojo in quite a number of ways.

Another issue with NULL key values is that most ad-hoc reporting tools do not play nicely with null keys in joins. Typically they will use inner joins, so rows with NULLs in the key column will break.

For most other dimensions you would use surrogates. This de-couples the dimensions from the source data and allows you to bring new data sources into the system without disrupting the existing data.

In some cases it may be useful to have a natural key as the dimension key. An example of this might be ISO currency codes or account numbers. In the former case the 3 letter code is small enough that the overhead of using it as a key is minimal and the coding scheme is (typically) universal across all data sources. In the latter case the codes are often numeric and short enough to fit into an integer anyway and typically universal across the organisation.

The prinicpal benefit of doing this is where you have reporting specialists using their own queries to manipulate the data. It makes the tables a bit more legible to people working directly with the data.

Upvotes: 4

Steve
Steve

Reputation: 245

1.Use surrogate key preferbly in the format of YYYYMMDD as this will always be sequential.

2.Eliminate Date dimension surrogate key and use actual date instead.

Use the first option. You use an actual date, it will use up more disk, be less efficient to index and less efficient to join. Also, if you use a date type, how would you represent 'Un-known date', 'Date not available Yet', 'Date earlier than Calendar Date Range', 'Date later than calendar date range'?

You can't use '01-01-1900' for all these differnt 'Unknown' type members of the calendar dimension. Ideally, all surrogate keys should be meaningless, otherwise it is too tempting for report users to ignore the dimensions. Dimensions add richness to dimensions, such as in the calendar dimension, the week the day belongs to, the month, the year, the day name etc etc etc. Reporters should be forced to use dimensions by anonymising surrogate keys.

You should never have nulls in the presentation layer of a data warehouse, because they will force you to do left outer joins between facts and dimensions. The whole point of a star schema is to improve query performance - doing outer joins goes directly against this principle. Always replace nulls with unknown member keys like -1, -2 etc etc (assuming your 'real' member keys start at 1 and go up from there sequentially).

Edit for Catcall

I'm suggesting using surrogate keys where the earliest date in your calender gets the key 1, the next date gets 2 and so on. In a star schema, the fact is made up of surrogate keys - all small integers or measures. You have to join to the dimension to see what date 1 represents. None of the surrogates means anything. That's how you build a data mart. How is the user supposed to remen=ber that '01-01-1900' means 'date not available' and '01-01-1901' means 'Unknown' and '01-01-1902' early date etc etc? Isn't it better to store these captions in the dimension and have surrogates -3, -2, -1 in the fact. Then the user will see the caption 'Unknown' or 'Early Date' in the report. Having a date type as a surrogate key is simply a mistake, and will limit the usefulness, efficiency and flexibility of the solution. I know this through experience.

Upvotes: 0

Related Questions