Index to query sorted values in keyed time range

Question

Suppose I have key/value/timerange tuples, e.g.:

CREATE TABLE historical_values(
  key TEXT,
  value NUMERIC,
  from_time TIMESTAMPTZ,
  to_time TIMESTAMPTZ
)

and would like to be able to efficiently query values (sorted descending) for a specific key and time, e.g.:

SELECT value
FROM historical_values
WHERE
  key = [KEY]
  AND from_time <= [TIME]
  AND to_time >= [TIME]
ORDER BY value DESC

What kind of index/types should I use to get the best lookup performance? I suspect my solution will involve a tstzrange and a gist index, but I'm not sure how to make that play well with the key matching and value ordering requirements.

Edit: Here's some more information about usage.

Ideally uses features available in Postgres v9.6.
Relation will contain approx. 1k keys and 5m values per key. Values are large integers (up to 32 bytes), mostly unique. Time ranges between few hours to a couple years. Time horizon is 5 years. No NULL values allowed, but some time ranges are open-ended (could either use NULL or a time far into the future for to_time).
The primary key is the key and time range (as there is only one historical value for a time range, per key).
Common operations are a) updating to_time to "close" a historical value, and b) inserting a new value with from_time = NOW.
All values may be queried. Partitioning is an option.

Erwin Brandstetter · Accepted Answer

DB design

For a big table like that ("1k keys and 5m values per key") I would suggest to optimize storage like:

CREATE TABLE hist_keys (
   key_id serial PRIMARY KEY
 , key text NOT NULL UNIQUE
);

CREATE TABLE hist_values (
   hist_value_id bigserial PRIMARY KEY  -- optional, see below!
 , key_id        int NOT NULL REFERENCES hist_keys
 , value         numeric
 , from_time     timestamptz NOT NULL
 , to_time       timestamptz NOT NULL
 , CONSTRAINT range_valid CHECK (from_time <= to_time)  -- or < ?
);

Also helps index performance.

And consider partitioning. List-partitioning on key_id. Maybe even add sub-partitioning on (range partitioning this time) on from_time. Read the manual here.

With one partition per key_id, (and constraint exclusion enabled!) Postgres would only look at the small partition (and index) for the given key, instead of the whole big table. Major win.

But I would strongly suggest to upgrade to at least Postgres 10 first, which added "declarative partitioning". Makes managing partition a lot easier.

Better yet, skip forward to Postgres 11 (currently beta), which adds major improvements for partitioning (incl. performance improvements). Most notably, for your goal to get the best lookup performance, quoting the chapter on partitioning in release notes for Postgres 11 (currently beta):

Allow faster partition elimination during query processing (Amit Langote, David Rowley, Dilip Kumar)

This speeds access to partitioned tables with many partitions.

Allow partition elimination during query execution (David Rowley, Beena Emerson)

Previously partition elimination could only happen at planning time, meaning many joins and prepared queries could not use partition elimination.

Index

From the perspective of the value column, the small subset of selected rows is arbitrary for every new query. I don't expect you'll find a useful way to support ORDER BY value DESC with an index. I'd concentrate on the other columns. Maybe add value as last column to each index if you can get index-only scans out of it (possible for btree and GiST).

Without partitioning:

CREATE UNIQUE INDEX hist_btree_idx ON hist_values (key_id, from_time, to_time DESC);

UNIQUE is optional, but see below.
Note the importance of opposing sort orders for from_time and to_time. See (closely related!):

Optimizing queries on a range of timestamps (two columns)

This is almost the same index as the one implementing your PK on (key_id, from_time, to_time). Unfortunately, we cannot use it as PK index. Quoting the manual:

Also, it must be a b-tree index with default sort ordering.

So I added a bigserial as surrogate primary key in my suggested table design above and NOT NULL constraints plus the UNIQUE index to enforce your uniqueness rule.

In Postgres 10 or later consider an IDENTITY column instead:

Auto increment table column

You might even do with PK constraint in this exceptional case to avoid duplicating the index and keep the table at minimum size. Depends on the complete situation. You may need it for FK constraints or similar. See:

How does PostgreSQL enforce the UNIQUE constraint / what type of index does it use?

A GiST index like you already suspected may be even faster. I suggest to keep your original timestamptz columns in the table (16 bytes instead of 32 bytes for a tstzrange) and add key_id after installing the additional module btree_gist:

CREATE INDEX hist_gist_idx ON hist_values
USING GiST (key_id, tstzrange(from_time, to_time, '[]'));

The expression tstzrange(from_time, to_time, '[]') constructs a range including upper and lower bound. Read the manual here.

Your query needs to match the index:

SELECT value
FROM   hist_values
WHERE  key = [KEY]
AND    tstzrange(from_time, to_time, '[]') @>  tstzrange([TIME_FROM], [TIME_TO], '[]') 
ORDER  BY value DESC;

It's equivalent to your original.
@> being the range contains operator.

With list-partitioning on key_id

With a separate table for each key_id, we can omit key_id from the index, improving size and performance - especially for the GiST index - for which we then also don't need the additional module btree_gist. Results in ~ 1000 partitions and the corresponding indexes:

CREATE INDEX hist999_gist_idx ON hist_values USING GiST (tstzrange(from_time, to_time, '[]'));

Store the day of the week and time?

Index to query sorted values in keyed time range

Answers (1)

DB design

Index

Related Questions