Database design for large tables

Question

I have tried to look at existing questions similar to mine, but haven't been able to find a clear answer though.

I work at a big coorporation where we have a big datawarehouse (billions of rows), however that is super slow and not really suited for ad hoc analytics - we are looking into something new, but time horizon is a few years from now; which I (and my department) can't really wait for. Therefore I have been granted a new blank SQL Server 2014 database, where I am to store information from our datawarehouse that we are going to be using often.

We are going to access this data mostly through thirdparty analytics tools, which doesn't cache data, but accesses it directly every time you click around or add a new chart etc.; hence we need performance to be as fast as possible, since it gets really frustrating to wait long every time you add a new dimension to a chart etc.

I am sourcing in data from our datawarehouse where the structure/design is usually pretty okay; however some things I find annoying (e.g. a Customer's name is stored with a date-ID, which means that if you look at one customer you will see their name changing over time - for my analytical purposes this doesn't make sense, I want to keep names (and other dimensions) constant going back in time.

Right now the data is not really split into Facts and Dimensions but rather something in between. I am thinking of re-structuring the data into Facts and Dimensions, so that e.g. the Customers name are not sitting with the financials but in a Dimensions table instead - that way I know I get the same name every time.

My question is this then: Is splitting data into Facts and Dimensions going to slow down performance comparing to having everything in one large table with all information (columns) on all rows? Will the joins be slowing my query?

I work with monthly data of 10-15 million rows = 120-180 million rows per year, for 3-6 years = maximum of approximately 1 billion rows (absolute maximum).

Did that make sense?

Thank you.

/Steffen.

Kannan Kandasamy · Accepted Answer

It is better to model as facts and dimensions which will help your reporting layer to query faster.

Saying that, it is highly matters on how we design Dimension tables and Fact tables. Typical idea is to have integer types as key in dimension and you will have flexibility to handle Slowly changing type I, type II's in future.

Designing facts also matters, Majority of the problems are due to IO so you can consider ColumnStore Indexes for Facts so that your data will be compressed and you will have faster performance, Go thru this link for better understanding:

ColumnStore Index

Database design for large tables

Answers (1)

Related Questions