It seems to me this question will be without precise answer since requires too complex analysis and deep dive into details of our system. We have distributed net of sensors. Information gathered in one database and futher processed. Current DB design is to have one huge table partitioned per month. We try keep it at 1 billion (usually 600-800 million records), so fill rate is at 20-50 million records per day. DB server currently is MS SQL 2008 R2 but we started from 2005 and upgrade during project development. The table itself contains SensorId, MessageTypeId, ReceiveDate and Data field. Current solution is to preserve sensor data in Data field (binary, 16 byte fixed length) with partially decoding it's type and store it in messageTypeId. We have different kind of message type sending by sensors (current is approx 200) and it can be futher increased on demand. Main processing is done on application server which fetch records on demand (by type, sensorId and date range), decode it and carry out required processing. Current speed is enough for such amount of data. We have request to increase capacity of our system in 10-20 times and we worry is our current solution is capable of that. We have also 2 ideas to "optimise" structure which I want to discuss. 1 Sensor's data can be splitted into types, I'll use 2 primary one for simplicity: (value) level data (analog data with range of values), state data (fixed amount of values) So we can redesign our table to bunch of small ones by using following rules: for each fixed type value (state type) create it's own table with SensorId and ReceiveDate (so we avoid store type and binary blob), all depended (extended) states will be stored in own table similar Foreign Key, so if we have State with values A and B , and depended (or additional) states for it 1 and 2 we ends with tables StateA_1 , StateA_2 , StateB_1 , StateB_2 . So table name consist of fixed states it represents. for each analog data we create seperate table it will be similar first type but cantains additional field with sensor value; Pros: Store only required amount of data (currently our binary blob Data contains space to longest value) and reduced DB size; To get data of particular type we get access right table instead of filter by type; Cons: AFAIK, it violates recommended practices; Requires framework development to automate table management since it will be DBA's hell to maintain it manually; The amount of tables can be considerably large since requires full coverage of possible values; DB schema changes on introduction new sensor data or even new state value for already defined states thus can require complex change; Complex management leads to error prone; It maybe DB engine hell to insert values in such table orgranisation? DB structure is not fixed (constantly changed); Probably all cons outweight a few pros but if we get significant performance gains and / or (less preferred but valuable too) storage space maybe we follow that way. 2 Maybe just split table per sensor (it will be about 100 000 tables) or better by sensor range and/or move to different databases with dedicated servers but we want avoid hardware span if it possible. 3 Leave as it is. 4 Switch to different kind of DBMS, e.g. column oriented DBMS (HBase and similar). What do you think? Maybe you can suggest resource for futher reading? Update: The nature of system that some data from sensors can arrive even with month delay (usually 1-2 week delay), some always online, some kind of sensor has memory on-board and go online eventually. Each sensor message has associated event raised date and server received date, so we can distinguish recent data from gathered some time ago. The processing include some statistical calculation, param deviation detection, etc. We built aggregated reports for quick view, but when we get data from sensor updates old data (already processed) we have to rebuild some reports from scratch, since they depends on all available data and aggregated values can't be used. So we have usually keep 3 month data for quick access and other archived. We try hard to reduce needed to store data but decided that we need it all to keep results accurate. Update2: Here table with primary data. As I mention in comments we remove all dependencies and constrains from it during "need for speed", so it used for storage only. CREATE TABLE [Messages]( [id] [bigint] IDENTITY(1,1) NOT NULL, [sourceId] [int] NOT NULL, [messageDate] [datetime] NOT NULL, [serverDate] [datetime] NOT NULL, [messageTypeId] [smallint] NOT NULL, [data] [binary](16) NOT NULL ) Sample data from one of servers: id sourceId messageDate serverDate messageTypeId data 1591363304 54 2010-11-20 04:45:36.813 2010-11-20 04:45:39.813 257 0x00000000000000D2ED6F42DDA2F24100 1588602646 195 2010-11-19 10:07:21.247 2010-11-19 10:08:05.993 258 0x02C4ADFB080000CFD6AC00FBFBFBFB4D 1588607651 195 2010-11-19 10:09:43.150 2010-11-19 10:09:43.150 258 0x02E4AD1B280000CCD2A9001B1B1B1B77

sql-serverperformancedatabase-designdata-warehouse

Reputation: 41

Recommendations for database structure with huge dataset

It seems to me this question will be without precise answer since requires too complex analysis and deep dive into details of our system.

We have distributed net of sensors. Information gathered in one database and futher processed.

Current DB design is to have one huge table partitioned per month. We try keep it at 1 billion (usually 600-800 million records), so fill rate is at 20-50 million records per day.

DB server currently is MS SQL 2008 R2 but we started from 2005 and upgrade during project development.

The table itself contains SensorId, MessageTypeId, ReceiveDate and Data field. Current solution is to preserve sensor data in Data field (binary, 16 byte fixed length) with partially decoding it's type and store it in messageTypeId.

We have different kind of message type sending by sensors (current is approx 200) and it can be futher increased on demand.

Main processing is done on application server which fetch records on demand (by type, sensorId and date range), decode it and carry out required processing. Current speed is enough for such amount of data.

We have request to increase capacity of our system in 10-20 times and we worry is our current solution is capable of that.

We have also 2 ideas to "optimise" structure which I want to discuss.

1 Sensor's data can be splitted into types, I'll use 2 primary one for simplicity: (value) level data (analog data with range of values), state data (fixed amount of values)

So we can redesign our table to bunch of small ones by using following rules:

for each fixed type value (state type) create it's own table with SensorId and ReceiveDate (so we avoid store type and binary blob), all depended (extended) states will be stored in own table similar Foreign Key, so if we have State with values A and B, and depended (or additional) states for it 1 and 2 we ends with tables StateA_1, StateA_2, StateB_1, StateB_2. So table name consist of fixed states it represents.
for each analog data we create seperate table it will be similar first type but cantains additional field with sensor value;

Pros:

Store only required amount of data (currently our binary blob Data contains space to longest value) and reduced DB size;
To get data of particular type we get access right table instead of filter by type;

Cons:

AFAIK, it violates recommended practices;
Requires framework development to automate table management since it will be DBA's hell to maintain it manually;
The amount of tables can be considerably large since requires full coverage of possible values;
DB schema changes on introduction new sensor data or even new state value for already defined states thus can require complex change;
Complex management leads to error prone;
It maybe DB engine hell to insert values in such table orgranisation?
DB structure is not fixed (constantly changed);

Probably all cons outweight a few pros but if we get significant performance gains and / or (less preferred but valuable too) storage space maybe we follow that way.

2 Maybe just split table per sensor (it will be about 100 000 tables) or better by sensor range and/or move to different databases with dedicated servers but we want avoid hardware span if it possible.

3 Leave as it is.

4 Switch to different kind of DBMS, e.g. column oriented DBMS (HBase and similar).

What do you think? Maybe you can suggest resource for futher reading?

Update: The nature of system that some data from sensors can arrive even with month delay (usually 1-2 week delay), some always online, some kind of sensor has memory on-board and go online eventually. Each sensor message has associated event raised date and server received date, so we can distinguish recent data from gathered some time ago. The processing include some statistical calculation, param deviation detection, etc. We built aggregated reports for quick view, but when we get data from sensor updates old data (already processed) we have to rebuild some reports from scratch, since they depends on all available data and aggregated values can't be used. So we have usually keep 3 month data for quick access and other archived. We try hard to reduce needed to store data but decided that we need it all to keep results accurate.

Update2:

Here table with primary data. As I mention in comments we remove all dependencies and constrains from it during "need for speed", so it used for storage only.

CREATE TABLE [Messages](
    [id] [bigint] IDENTITY(1,1) NOT NULL,
    [sourceId] [int] NOT NULL,
    [messageDate] [datetime] NOT NULL,
    [serverDate] [datetime] NOT NULL,
    [messageTypeId] [smallint] NOT NULL,
    [data] [binary](16) NOT NULL
)

Sample data from one of servers:

id      sourceId    messageDate          serverDate messageTypeId   data
1591363304  54  2010-11-20 04:45:36.813 2010-11-20 04:45:39.813 257 0x00000000000000D2ED6F42DDA2F24100

1588602646  195 2010-11-19 10:07:21.247 2010-11-19 10:08:05.993 258 0x02C4ADFB080000CFD6AC00FBFBFBFB4D

1588607651  195 2010-11-19 10:09:43.150 2010-11-19 10:09:43.150 258 0x02E4AD1B280000CCD2A9001B1B1B1B77

Upvotes: 4

Answers (4)

Markus

Reputation: 1557

You could store the datetime stamps as integers. I believe datetime stamps use 8 bytes and integers only use 4 within SQL. You'd have to leave off the year, but since you are partitioning by month it might not be a problem.

So '12/25/2010 23:22:59' would get stored as 1225232259 -MMDDHHMMSS

Just a thought...

Upvotes: 0

Ronnis

Reputation: 12843

Storage-wise you are probably going to be fine. SQL Server will handle it.

What worries me is the load your server is going to take. If you are receiving transactions constantly, you would have some ~400 transactions per second today. Increase this by a factor of 20 and you are looking at ~8,000 transactions per second. That's not a small number considering you are doing reporting on the same data...

Btw, do I understand you correctly in that you are discarding the sensor data when you have processed it? So your total data set will be a "rolling" 1 billion rows? Or do you just append the data?

Upvotes: 1

Damir Sudarevic

Reputation: 22177

One technique, not so much related to databases, is to switch to recording a change in values -- with having minimum of n records per minute or so. So, for example if as sensor no 1 is sending something like:

Id  Date              Value
-----------------------------
1 2010-10-12 11:15:00 100
1 2010-10-12 11:15:02 100
1 2010-10-12 11:15:03 100
1 2010-10-12 11:15:04 105

then only first and last record would end in the DB. To make sure that the sensor is "live" minimum of 3 records would be entered per minute. This way the volume of data would be reduced.

Not sure if this helps, or if it would be feasible in your application -- just an idea.

EDIT

Is it possible to archive data based on the probability of access? Would it be correct to say that old data is less likely to be accessed than new data? If so, you may want to take a look at look at Bill Inmon's DW 2.0 Architecture for The Next Generation of Data Warehousing where he discusses model for moving data through different DW zones (Interactive, Integrated, Near-Line, Archival) based on the probability of access. Access times vary from very fast (Interactive zone) to very slow (Archival). Each zone has different hardware requirements. The objective is to prevent large amounts of data clogging the DW.

Upvotes: 1

AdaTheDev

Reputation: 147374

Just going to throw some ideas out there, hope they are useful - they're some of the things I'd be considering/thinking about/researching into.

Partitioning - you mention the table is partitioned by month. Is that manually partitioned yourself, or are you making use of the partitioning functionality available in Enterprise Edition? If manual, consider using the built in partitioning functionality to partition your data out more which should give you increased scalability / performance. This "Partitioned Tables and Indexes" article on MSDN by Kimberly Tripp is great - lot of great info in there, I won't do it a injustice by paraphrasing! Worth considering this over manually creating 1 table per sensor which could be more difficult to maintain/implement and therefore added complexity (simple = good). Of course, only if you have Enterprise Edition.

Filtered Indexes - check out this MSDN article

There is of course the hardware element - goes without saying that a meaty server with oodles of RAM/fast disks etc will play a part.

Upvotes: 1

Recommendations for database structure with huge dataset

Answers (4)

Related Questions