Reputation: 135
Hi I am doing a project on datawarehouse and i am not sure whether i correctly modeling my datawarehouse. My datawarehouse is not on business process thus i find very little information about that.
Basically I have alot of library file and each library file contain many cell information, each cell contain many pin information, each pin contain the timing and power information. The different library file basically contain the same number of cell and also pin formation, just the timing/power information is differen
library -> cell -> pin -> timing/power
I am interested to know the cell property -timing/power so that i can make comparison later.
Should i model by warehouse in snowflake schema where my fact table only contain foreign key of library dimension and date dimension. The library dimension is then further divided to cell dimension, and cell dimension is further divided to pin dimension, and pin dimension is divided into timing and power dimension
or in star schema where my fact table contain the foreign key of library,cell,pin,timing,power and date dimension?
My concern is my data is very large as i have around 200 library file, each library file contain around 20k cell, each cell contain few pin and each pin contain a few timing and power information. Thus the total size is large that is 200 x 20,000 x 4 x 4
I will be constantly pump in this large set of data whenever there is a new version of library file released
Can give me suggestion which one better?dfdf
Edit:
Library A
Cell A
Pin A1
Condition A11
riseTimingTemplate
fallTimingTemplate
risePowerTemplate
fallPowerTemplate
The hierarchy is show above. Different library will contain the same cell,pin and condition and only the timing and power template is different.
Let say my fact grain would be timing and power value of particular cell
so my dimension would have library, cell, pin, condition, risingTimingTemplate, fallTimingTemplate, risePOwerTemplate and fallPowerTemplate, all link to the fact table correct?
Upvotes: 1
Views: 1329
Reputation: 321
This will depend heavily on the database technology you are using, your reporting requirements, and ETL performance objectives; here are some thoughts to consider based on your situation above. The most direct part of the answer to your question is in the olap section below.
Inman:
For space efficiency, etl speed, and design simplicity; consider using an Inman design which is entity based. For your example, this might be one table for each entity with the facts stored directly on the table, linked in the hierarchy with a simple key/fk link between the tables. For example, you might have 4 tables as you mention above. Link these tables with a natural key to avoid dependencies on sort order or other random chance based keys in case you need to compare previously loaded data with the current load. This can also be more space efficient for facts where time is not smooth. However, this is a tradeoff if a time series or other series must be smooth as reporting may be more difficult to work out any smoothing requirements.
Kimball:
A good example to use a Kimball model in this case is when there are many different facts but all at the same grain. Mixing facts at different grains makes the data model more complex to build and use. I define a grain in your example as facts at the cell level and then facts at the power level are on a different grain assuming there is a 1-n relationship between each table. Where you might simply store measurements on the entity in an Inman design, you would normally separate facts from dimensions in the Kimball design which creates additional tables to keep these measures.
OLAP:
If you are using OLAP technologies for your query engine this is a little more complex. Most will require the data model as a star schema. Most engines won't allow a snowflake definition as it risks a 'product join' which happens when you have n-1 or n-n relations between dependent dimension tables. The result of n-n dimension tables can be duplicate fact rows if not carefully handled and these are very hard to troubleshoot when queries result in values too high when these duplicates occur in the data simply by joining in those n-n dimensions. ** If you can guarantee the outer dimensions are always 1-1 or 1-n (meaning 1 library to x cells but never 2+ libraries to a cell) and your measures are recorded at the timing/power level then a snowflake is a great way to store your data (least space) but you may still need to build 'views' so your design appears to be a star to your OLAP toolset (a few newer olap engines will allow these designs). Building your data as a Star will take more space but may allow you to spot n-n or n-1 incidents more easily.
Keep in mind, you have to keep the lowest level key in the fact table if you ever plan to report on the smaller grain (in this case the key you represent in the star, your snowflake will require the same key structure), its just a matter of space efficiency by not needing to duplicating the library information at each cell as you would in a star.
[Edit from comments]
Based on comments and your supplied model, a few thoughts:
Lastly a tip for the date dimension: the pk for date can be a formatted # based on the detail level of the date/time collection (ie: 20150101 might be the key for Jan 1 2015, add a second table for time if you need time otherwise the date table will expand in storage very quickly. Makes the ETL build faster and you can even skip joins to the date table for basic date data)
Upvotes: 0