Riccardo Neri
Riccardo Neri

Reputation: 810

Optimize database structure - Improve queries speed

I'm writing a new software which will work with pretty big database. This software is going to catch data from many scales which will send many weights values to the software. Each scale has a serial number.

Considering the basic information of each row which are weight_id, scale_id, weight and timestamp...is it better a unique table which will have a column called scale_id or is it better a different table for each scale? Consider that we can't have too many scales...the max number would be 16...much more often 3-4.

Case A Unique table

Case B Table scale_1123, Table scale_2222

I'm in doubt on this matter because we expect a number of rows for each scale that could be up to 100.000.000 every year...and the machines can run up to 10 years...maybe more.

Finally, shall I break the the tables by month or week? Or can I keep them together?

The goal is - when we have a big database - to make query on a certain time range to extract statistical data of one or more scale in the shortest possible time (Like average weight, STD. deviation, ...)

I'm sorry for these many questions...but reading on the database documentation I can't find which is the best answer

Upvotes: 0

Views: 522

Answers (2)

PerformanceDBA
PerformanceDBA

Reputation: 33708

Preliminaries

reading on the database documentation I can't find which is the best answer

Database documentation informs you about how to use the product, it doesn't inform you about how to design a database. For that, you need education.

The platform is not defined yet...so we are flexible in this stage

Well, the most important advice is, get a real SQL platform (ie. one that complies with the ISO/IEC/ANSI SQL Standard), and you won;t have to rewrite your code when you migrate from a pretend-sql or non-sql to a real SQL. There is a lot of freeware/shareware/vapourware around these days, all non-compliant, all making a big pretence by referring to "SQL" in the manuals, on every page, but they are frauds. They have many extras but they lack the basics.

You always get what you pay for, so make sure that you pay something of value, in order to obtain something of value. The commercial SQLs (except Oracle) have a server architecture, they are three orders of magnitude faster than the vapourware.

Record ID

In these dark days of IT, the trend appears to be:

  • Focus on the data, especially the data values. That gives you a nice spreadsheet perspective

  • Stamp an ID field on every file. That fixes the spreadsheet in concrete.

  • Implement that in a database container in some non-SQL platform.

  • write the app

  • Expect any and all Relational Database capabilities from it.

Now, where exactly, in the above steps, is the use of Relational Database principles; Modelling; Design; Normalisation; etc ? On what basis can any reasonable person expect any Relational capability (let alone Integrity; Power; or Speed) from that thing ? How does that thing gain Relational capabilities ?

If you place a bag of rubbish in a box labelled "Swiss Chocolate Top Quailty", it remains rubbish. The placement does not magically transform the rubbish into Swiss chocolates.

The point is two-fold:

  1. If you have not educated yourself re Relational dabase technology, and applied it, there is no way that the thing is going to be Relational.

  2. The result of the above steps is always a pre-1970 ISAM Record Filing System. With none of the Integrity, Power, or Speed of a Relational Database.

Now you have been reading books. Good. But the problem there is, all the books that allege to be about Relational Databases are written by people who are total ignorant of it. Forty five years has passed since Dr E F Codd wrote the Relational Model, 34 years since we have had genuine RDBMS platforms; standards; methods, but 95% of the planet is still implementing pre-1970 ISAM RFS. Why ? Because that is what the books teach. Why ? Because that is all the authors actually know. They can't teach what they do not know.

You are a bright and capable guy, but you have been subverted. So that has to be corrected first. Please read this Answer. Take your time. Actually use, and experiment with, the example SQL code given.

Two summary points. If you stamp an ID on your spreadsheet perspectives of the data:

  1. It will be an RFS, not an RDb.

  2. It will cripple the modelling process.

Relational Database

But you have tagged your question with database, performance, optimisation, so I assume you want all that.

It takes Swiss milk, and the finest cocoa, to make Swiss chocolate. In order to produce a Relational Database, one must

  • Model the data, only the data, and nothing but the data. That means no reference to the usage, or the app, or the reports

  • Using Relational Data Modelling technology and Normalisation

  • Determine Relational Keys (that is what the Integrity, Power, and Speed derives from)

  • Understand that the database is a collection of facts, about the real world that the app will engage with. It is not a collection of records with field that may be related (which is the spreadsheet perspective).

That will produce a genuine Relational Database, where any report can be written re the data, in a single SELECT command. SELECTS that you have dreamt about, as well as SELECTs that you are not capable of dreaming about, yet.

Questions

The goal is - when we have a big database - to make query on a certain time range to extract statistical data of one or more scale in the shortest possible time (Like average weight, STD. deviation, ...)

That is just some examples of the SELECTs you can dream about, right now. None of them are complex, each of them can be produced using a single SELECT. From an RDb. From an RFS, it will take you about 10 times longer to write the code, and many iterations to get the data right. And it wll require at least 10 times the hardware resources.

The method is, first we get it right. That means Relational, which is very, very fast, and can handle billions of rows. Then, if and only if necessary, we improve performance using genuine methods. It is an hysterical myth, that one can implement performance, on something that is not right.

is it better a unique table which will have a column called scale_id or is it better a different table for each scale?
Finally, shall I break the the tables by month or week?

Hideous.

Never "break" a table up.

You might be happy to code for it, to look in two places or 16 places for one thing, but no-one else will be. The users will be cursing you long after you have left the project. That is pre-1960 methodology. We put men on the Moon in 1969. This is 2015. We talk in GB now, not KB, not MB.

Considering the basic information of each row which are weight_id, scale_id, weight and timestamp

Unfortunately, that isn't the data, that is the result of the above non-relational steps, which includes all kinds of superfluous non-data. We will have to model the data correctly, first, before we can make considerations.

You haven't posted the incoming data from the scales. I will assume:

    serial_no
    date_time   -- time_stamp is misleading
    weight

And that somehow, somewhere, you are going to be informed as to what is being weighed, but not through that feed.

If there is anything else that comes in via the feed, please tell me now. Items such a ScaleReset, etc that might have to be recorded.

Case A Unique table Case B Table scale_1123, Table scale_2222

Ok, so option B appears to be one table per scale. Hideous. Can you imagine the SELECT required for Std Deviation across the scales. The people who write the books that suggest this should be put in an asylum.

Second, it is worrying about performance far too early.

Ok, in that case, the option A "unique table" (that confused me, because all relational tables are unique, they have unique rows) appears to be all scales in one table, which is more correct, except for the useless and misleading ID field.

Consider that we can't have too many scales...the max number would be 16...much more often 3-4.

Makes no difference whatsoever. The system might grow, you might have many customers.

I'm in doubt on this matter because we expect a number of rows for each scale that could be up to 100.000.000 every year...and the machines can run up to 10 years...maybe more.

Makes no difference whatsoever. The system might grow, you might have many customers. Worrying about performance too early, before a performance issue has been determined. First we make it right, then we make it faster.

3.17 inserts per second per scale is nothing to worry about in a Relational Database. What you should be worrying about is the fact that you don't have one, that you have an RFS. That will break under the load. And then you will have to perform all sorts of acrobatics to "improve" the negative performance. Better to get the data into an RDb.

16 Billon rows is no problem for a genuine SQL platform. The pretend-sqls will cack themselves at about 2 billion rows, if not earlier.

Solution

Here is the Data Model that is required.

  • Continuous Weight Data Model

  • If you are not used to the Notation, please be advised that every little tick, notch, and mark, the solid vs dashed lines, the square vs round corners, means something very specific. Read the IDEF1X Notation carefully.

  • Please check the Predicates carefully. They are very important in verifying the model. If this is not clear, please ask.

  • Each scale has a serial no. There can be no better unique row Identifier.

  • In the history table, the DateTime is the obvious component to be added, to form uniqueness, it is in the data (a Key msut be made up from the data).

  • No ID fields are needed. If you put them in, they will be (a) superfluous (b) an additional index (c) add a burden to insert performance.

  • The Key distributes the data evenly. That means high concurrency, because concurrent inserts (3.12 per sec, time 4 to 16 scales) will be spread across the table, there is no conflict.

    • Conversely, a "PK" which is an ID field guarantees that all concurrent inserts will conflict, because they have to write to the last page. It is a guaranteed "hot spot".

    • Spreading the insert load is possible only on real SQL platforms. Use a Clustered Index on the PK.

    • However, if it is not managed properly, that will result in page splits. The method is to set a suitable FILLFACTOR, depending on how often the index is rebuilt. (eg. I rebuild the Clustered Indices once every three years, and only on the largest tables, over 50GB, using a FILLFACTOR of 80%, leaving 20% for intervening inserts. Tables smaller than that never need rebuilding after the first rebuild.)

  • The purpose of the Alternate Key is to provide instantaneous access by DateTime, across all scales, ie. your time range queries. Time range queries within one scale will use the PK index, not the AK index, and they will be instantaneous as well.

    • Conversely, if you had ID fields, separate to guaranteeing continuous conflict on the insert "hot spot", all your queries would be "jumping" all over the file.

Upvotes: 4

Gordon Linoff
Gordon Linoff

Reputation: 1269493

Your question is actually a bit different from most questions of this genre, but I don't think that changes the answer. The generic answer is "store them all in one table and use partitioning and indexing to get the performance you need."

However, you are talking about 100,000,000 rows per year per scale. With 10 years and 16 scales, that is up to 16,000,000,000 rows. Including the scale id as a 4-byte integer (versus storing the data in different tables) means an addition 64 Gbytes of storage. That is not trivial, but, of course, it will seem like much less in 10 years time.

I can't answer your question (despite my biases) but here are things you should be thinking about:

  • How is the data being inserted? Is this one-at-a-time or a steady flow of 3/second per scale?
  • What are the response time goals (requirements?) for various queries?
  • What are the recency requirements for newly added data? How quickly does it need to be available?
  • How often will the queries only be looking at one scale versus multiple scales?
  • Can something dramatically change over the course of 10 years? (The answer to this question is probably "yes".) Would that be volume of data, number of scales, number of measurements per row?
  • Will queries typically be going back through the historical data bringing back lots of different rows?
  • What requirements are there for backup and recovery?
  • What permissioning requirements are there for users?

There are many possible architectures for this volume of data, besides the two that you suggest:

  • Storing each scale in a single table in a different database. More practical for isolating failures and controlling users.
  • Storing each scale in a single table in a different database, partitioned by time.
  • Storing everything in a single table.
  • Storing each scale in a separate partition in a single table.
  • Storing each scale in a separate table.
  • Storing each scale in a separate table, partitioned by time.
  • Storing each scale in one table partitioned by both scale and time.

And, I think the list goes on.

Upvotes: 1

Related Questions