user45867
user45867

Reputation: 977

Slowly Changing Dimensions - exact SQL query implementation to retrieve correct data

I'm a bit new to BI development/ data warehousing, but am facing the old Slowly Changing Dimensions dilemma. I've read a lot about the types and theory, but have found little in terms of, what I view, would be the most common SELECT queries against these implementations.

I'll keep my example simple. Say you have four sales reasons, East, West, North, and South. You have a group of salespeople that make daily sales and (maybe once a year) get reassigned a new region.

So you'll have raw data like the following:

name; sales; revenue; date
John Smith; 10; 5400; 2015-02-17

You have data like this every day.

You may also have a dimensional table like the following, initially:

name; region
John Smith; East
Nancy Ray; West
Claire Faust; North

So the sales director wants to know the monthly sales revenue for the East region for May 2015. You would execute a query:

SELECT region, month(date), sum(revenue)
from Fact_Table inner join Dim_Table on name = name
where region = East and date between ....
[group by region, month(date)]

You get the idea. Let's ignore that I'm using natural keys instead of surrogate integer keys; I'd clearly use surrogate keys.

Now, obviously, sales people may move regions mid year. Or mid month. So you have to create a SCD type in order to run this query. To me personally, Type 2 makes the most sense. So say you implement that. Say John Smith changed from East region to West region on May 15, 2015. You implement the following table:

name; region; start_date; end_date
John Smith; East; 2015-01-01; 2015-05-15
John Smith; West; 2015-5-15; 9999-12-31

Now the sales director asks the same question. What is the total sales revenue for the East for May 2015? Or moreover, show me the totals by region by month for the whole year. How would you structure the query?

SELECT region, month(date), sum(reveneue)
from Fact_Table inner join Dim_Table
on name = name
and date between start_date and end_date
group by region, month(date)

Would that give the correct results? I guess it might --- my question may be more along the lines of --- okay now assume you have 1 million records in the Fact table ... would this inner join be grossly inefficient, or is there a faster way to achieve this result?

Would it make more sense to write the SCD (like region) directly into a 'denormalized' Fact table --- and when the dimension changes, perhaps update a week or two's worth of Fact record' regions retroactively?

Upvotes: 1

Views: 1224

Answers (1)

Ron Dunn
Ron Dunn

Reputation: 3078

Your concept is correct if your business requirement has a hierarchy of Region->Seller, as shown in your example.

The performance of your current query may be challenging, but it will be improved by the use of appropriate dimension keys and attributes.

Use a date dimension hierarchy that includes date->Month, and you'll be able to avoid the range query.

Use integer, surrogate, keys in both dimensions and your indexing performance will improve.

One million rows is tiny, you won't have performance problems on any competent DBMS :)

Upvotes: 2

Related Questions