Mikko Ohtamaa
Mikko Ohtamaa

Reputation: 83576

TimescaleDB: performance of a hypertable with append vs. midpoint inserts and indexing

I have some time-series data that I am about to import into TimescaleDB, as (time, item_id, value) tuples in a hypertable.

I have created an index:

CREATE INDEX ON time_series (item_id, timestamp DESC);

Does TimescaleDB have different performance characteristics when inserting rows in the middle of time series vs. appending them at the end of the time? I know this is an issue for some of native PostgreSQL data structures like BRIN indexes.

I am asking because for some item_ids I might have patchy data and I need to insert those values after other item_ids have filled the tip of time series. Basically, some items might be old data that is seriously behind the rest of the items.

Upvotes: 3

Views: 1093

Answers (1)

eshirvana
eshirvana

Reputation: 24603

I don't think It reacts differently,

in your case the insert performance will be depends on

  • how many indexes you have on that table?are they all really needed?
  • Are those indexes have minimum required columns?
  • Use parallel insert/copy. see here for more info.
  • Insert rows in batches
  • configure your shared_buffers properly (25% of available RAM recommended by documentations)

but this tip is going to help you the best

  • Write data in loose time order When chunks are sized appropriately, the latest chunk(s) and their associated indexes are naturally maintained in memory. New rows inserted with recent timestamps will be written to these chunks and indexes already in memory.

If a row with a sufficiently older timestamp is inserted – i.e., it's an out-of-order or backfilled write – the disk pages corresponding to the older chunk (and its indexes) will need to be read in from disk. This will significantly increase write latency and lower insert throughput.

Particularly, when you are loading data for the first time, try to load data in sorted, increasing timestamp order.

Be careful if you're bulk loading data about many different servers, devices, and so forth:

Do not bulk insert data sequentially by server (i.e., all data for server A, then server B, then C, and so forth). This will cause disk thrashing as loading each server will walk through all chunks before starting anew.

Instead, arrange your bulk load so that data from all servers are inserted in loose timestamp order (e.g., day 1 across all servers in parallel, then day 2 across all servers in parallel, etc.)

source: TimeScale blog

Upvotes: 3

Related Questions