When to save time-series data

Question

We're collecting market data on about 30,000 financial instruments. We want to keep historical data for every 10 minutes or so. It's all saved in a PostgreSQL table. I am debating between two approaches:

"Snapshot"

Store price of all symbols every 10 minutes, with nice round timestamp.

Advantages:

Makes querying easy, since timestamp is known a-priori just by rounding to last 10-minute multiple.

Disadvantages:

Larger data set
Large inserts will affect performance
Won't convey how often instrument data changes without storing additional information

"Rolling Updates"

Store each symbol only when it is updated, if time elapsed since last update is longer than 10 minutes.

Advantages:

Fewer and smaller (cheaper) inserts
Smaller data set
Data will more closely reflects actual frequency of changes (for instruments which change less than once every 10 minutes)

Disadvantages:

Queries will be more complex/expensive because timestamp of desired row isn't known.

Considerations

We have many more inserts than queries
We will want to be able to scale to significantly more instruments, possibly slightly higher frequency updates.

I have been doing "Rolling Updates" and I don't see any performance problem with the queries. There is only a single multi-column index on the table, but inserts still seem to be much more expensive than queries, so this seems to be the better-suited method. Is this a reasonable approach? Are there other considerations I am missing?

Aryeh Leib Taurog · Accepted Answer

I'm re-implementing our feed and I'm switching from rolling updates to snapshots. It was easier to code; I don't have to keep track of when to store what. The data are loaded into a carefully indexed PostgreSQL table using binary copy, so insert performance isn't an issue; we're seeing rates of at least a few thousand records/sec, which is sufficient.

I am not using specifically round timestamps, but that would make it even easier to retrieve the entire snapshot, should we want to do so. At this point, we only retrieve data for one symbol at a time, at a single point in time.

Most of the symbols we deal with change much more than once every 10 minutes, so in any case our data set doesn't reflect the frequency of change in these symbols.

Update: We've started making more extensive use of the historical data. The ease with which we can now retrieve larger blocks of data for a single point in time is a real boon.

When to save time-series data

"Snapshot"

"Rolling Updates"

Considerations

Answers (2)

Related Questions