Reputation: 49

DWH - Data warehouse - design

I am trying to design a DWH for a reporting system and would like to ask the profis about their opinion concerning the best design.

It is more complex but to keep it simple, suppose the following scenario:

Each customer has “group of cars”
There are “set of To-dos” to be checked with each customer, and
There might be “calls” related to each customer

In our DB, the data for each subject (Customer – Cars – To-dos – Calls) are stored in a Table. The Customer table has a one-to-many relationship to (cars – To-dos - Calls) tables as shown in the pic

I was thinking for the reporting system to set the following dimensions and fact tables: - DIM_Customer - DIM_Car - DIM_To-do - DIM_Call - Fact_ALL

As I mentioned it is more complex Anyway, I am stuck now since it is not clear how to aggregate my fact table --> since there are a group of Cars AND a group of To-dos AND multiple Calls

DIM_Customer_ID DIM_Car_ID DIM_To-do_ID DIM_Call_ID Cars_Count To-dos_Count Calls_Count

Q1: Using a bridge table between each (car, to-do, call) dimensions and the fact table --> its grain is „per customer per group-of-cars per group-of-To-dos per group of calls, right? Is there any other easier or better solution? How to solve such a cardinality?
Q2: in the Db each Customer has a unique ID --> is it possible to use this customer_ID to connect the dimension tables and the fact table? Or this is not recommended since the dim and fact tables are now not (PK-FK) connected (customer_ID is only as a PK in the DIM_customer but as FK in the other DIMs) … I am just trying to make it as simple as possible

Q3: Is there any better or more positional DWH design

This is my first DWH design, so excuse me if there are any Silly ideas :)

Thank you

Upvotes: 0

Answers (2)

Wes H

Reputation: 4439

It seems like you're misunderstanding what a fact table should be. The fact is the measurement of the event that occurs at the grain of the event.

As a subset of your example, assuming a FactCarPurchase containing measures such as Count, Mileage, PurchaseDate, PurchasePrice slicing by DimCustomer would let you see average cars per customer or for count of cars for one customer, or the average number of miles driven by a customer, or total spent by a customer. Slicing by DimCars would enable measures such as AverageMileage/AveragePurchasePrice or CountOfCars by Make/Model.

A fact captures the measures "facts" about an event. You will usually have more than one fact table and, unless your model is very small, it is highly unlikely that a single fact will link to every dimension.

First, start by describing what you want to measure (purchase event, trip event, repair event), then determine which dimension make sense to link to the fact. Aslo, the relationship may go through multiple dimensions and facts. For example, if you want to know how much a customer spent on repairs in a year, your model may look something like this:

dCustomer <- fCarPurchase -> dCars <- fCarRepair -> dDate

Upvotes: 1

Cyrus

Reputation: 2195

As always, the driving questions should be "what do I want to measure/aggregate?" and "How do I want to slice and dice the data?". Those give you the fact tables and dimension tables respectively.

Fact tables

If you want to measure only how many calls a customer made and when required todo lists were completed, a single fact table (Fact_Event or similar) will serve you well. Add both Call and Todo keys and enter key 0 or -1 (with all the fields 'N/A' in that dimension record) for the dimension that doesn't apply.

If on the other hand the Todos have a large number of steps being completed at different times and you want to measure the progress and lead time between steps, it makes more sense to store them in a separate fact table with a whole list of Date keys that simply don't apply to Calls.

In all scenarios you would link to the Car and maybe (see next part) Customer dimensions with their two keys.

Dimension tables

In your comment, you state that each Customer appears only once, with a single group of cars. That makes them rather unconventional as customers go in this type of solution, as information on (returning) customers' purchasing choices and support demands (costs) is considered very valuable.

In this case, the least complex scenario is only having a single dimension, called Car or Car_Customer. That dimension will then contain the car details, group identification and all the relevant customer details. If your average customer has a group of 1-5 cars, their details only get duplicated that many times, which is preferable to having multiple joins.

Because there is a good hierarchy, report users will not have any trouble organizing the data by customer, car or any combination.

If you decide you have returning customers after all (even if they are always entered fresh in the source system), you can add a customer matching/merging routine in your ETL and then a separate Customer Dimension makes more sense.

Group-of-cars (or Contracts?) as a separate dimension only makes sense if customers have multiple of them and/or cars can be different groups at different times. In that case, I suggest linking it only through the fact table, adding a Group key there. Bridge tables between dimensions are possible but can get messy.

Alternative: Don't use a Dimensional model

The data you have appears to be a bad fit for a Dimensional model, which is designed with the idea that often-repeated data (like customer details and product details) goes into wide dimension tables with low rowcounts, leaving the fact tables narrow, allowing for good performance at high rowcounts.

In your case Cars seem to be the only often-repeated entities, and even that is not clear from the model you describe. It makes more sense to model your tables to resemble the source model in your first picture, even using the same unique IDs, adding extra flags, descriptions and categories needed for reporting.

Only if you want to track changes to those tables (Cars for example) over time will you need to introduce dimension logic.