Reputation: 13768
I have timeseries data about a number of items that I store (in this toy example) in a simple pair of tables. For now, this is done in MySQL, but if sufficiently strong reasons exist for trying to solve my problem in a different DBMS, I'd be all ears!
The ITEM
table has a primary key and a single text-like column that can be thought of a description, let's call it descr
. The DATAPOINT
table has a primary key and 3 other columns: a foreign key into the ITEM
table (call it fk_item
), a datetime i'll call timestamp
and float value that we'll call value
. Further, there is a joint uniqueness constraint on the (fk_item, timestamp)
column pair (we only want one value in the DB for a given item at a given time).
To put real numbers on it, the DATAPOINT
table has about 1bn rows, which is the result of having approximately 100k rows for each of 10k distinct items.
My question is about the ability to optimize both read and write performance in this context, and the best way to enforce that uniqueness constraint.
A typical read from this DB will involve a small number of items (half a dozen?) for which we want to get all values in a given datetime range (containing approximately 1k points per item). To that end, it would be very handy to have an index which is (fk_item, timestamp)
and to enforce the joint uniqueness criteria on this index. This motivation behind reads of this type is: "I want to make a line graph of 2 or 3 items for this 3 year range".
However, a typical write for this database would look very different. It would be an insertion of a single data point for each of several thousand items, all with the same (or a small number of) timestamps. This motivation for this kind of write can be thought of intuitively as: "I want to add yesterday's datapoint for every single item". So for writes of that sort, it would be more practical to have an index which is (timestamp, fk_item)
, and to enforce the uniqueness restriction on that index.
Importantly, for the scale of my data and hardware, neither of these indices can be fit entirely into RAM.
Typically, the vast majority of the writes happen in just a short time each day: i.e. at the end of each day all the data for that day gets written in a 15 minute period, and then reads occur throughout the day (but generally not during that 15 minute period).
So, from what I understand, if I build the table with the read-optimized (fk_item, timestamp)
index (and put the uniqueness constraint there), then my typical reads will be nice and speedy. But I'm concerned that my writes will be slow because we will need to update the index in a non-contiguous way. However, if I build the table with the write-optimized (timestamp, fk_item)
index (and put the uniqueness constraint there) then my typical writes will be speedy but my typical reads will suffer.
Is there any way to get the best of both worlds? For example, if I build two indices: (fk_item, timestamp)
and (timestamp, fk_item)
and place the uniqueness only on the latter of the two, will that work well? Or will writes still proceed at the "slow" speed because even though there is a write-optimized index (to check the uniqueness constraint, for example), the read-optimized index will need to be updated on any inserts, and that update will be non-contiguous?
Thanks in advance!
Upvotes: 3
Views: 158
Reputation: 142208
Short answer: (fk_item, timestamp)
only.
Long answer:
As far as uniqueness goes, (fk_item, timestamp)
and (timestamp, fk_item)
are the same. While they both declare uniqueness equally well, they both suck at being unique. Someday, a particular item will show up twice in the same second.
You did mention "yesterday". So, if the entry is really a subtotal for the day, then (fk_item, date)
is reasonable.
When building an index, it is alway better to have the date/time item last. This is so that WHERE fk_item = 123 AND date BETWEEN ... AND ...
can use that index. Writes don't care (much) what order things are in.
What about the PRIMARY KEY
? It is, but MySQL's definition, UNIQUE
and an INDEX
. So, if (fk_item, date)
is reasonable, make it the PK. This will make queries that need to look at several rows for a specific item more efficient.
"I want to make a line graph of 2 or 3 items for this 3 year range". -- If that involves millions of rows, then you have designed the schema inefficiently. You need to build and maintain a Summary table of, say, daily values for each item. Then it would be hundreds, not millions, of rows -- much more viable.
Back to the INSERTs
. With 10k distinct items and PRIMARY KEY(fk_item, date)
, there would be 10K spots in the table where the insert occurs. This is actually OK, and will be roughly the same speed as some other ordering.
The daily INSERTs
are best done with either LOAD DATA INFILE
or with multi-row INSERTs
.
I am speaking from a MySQL perspective. Some, though perhaps not all, of what I say applies to other products.
PARTITIONing
is a useless idea for MySQL unless you intend to purge 'old' data. (I can't speak for Posgres.)
If you insert rows randomly you may run into unrealistic performance problems. This because your real situation will be much less "random". There will be only 10K spots where you do INSERTs
today, not 1 billion. And tomorrow, it will be the 'same' 10K spots.
"how a table like this should be constructed" -- Minimize datatypes (eg, don't use an 8-byte BIGINT
for a yes/no flag); Provide the optimal PK (I suggested (item, day)
). But you must have tentative SELECTs
in order to settle on the secondary indexes. Normalize where appropriate (item_id
), but don't over-normalize (dates).
Upvotes: 1