Aryeh Leib Taurog
Aryeh Leib Taurog

Reputation: 5598

When to save time-series data

We're collecting market data on about 30,000 financial instruments. We want to keep historical data for every 10 minutes or so. It's all saved in a PostgreSQL table. I am debating between two approaches:

"Snapshot"

Store price of all symbols every 10 minutes, with nice round timestamp.

Advantages:

Disadvantages:

"Rolling Updates"

Store each symbol only when it is updated, if time elapsed since last update is longer than 10 minutes.

Advantages:

Disadvantages:

Considerations

I have been doing "Rolling Updates" and I don't see any performance problem with the queries. There is only a single multi-column index on the table, but inserts still seem to be much more expensive than queries, so this seems to be the better-suited method. Is this a reasonable approach? Are there other considerations I am missing?

Upvotes: 1

Views: 461

Answers (2)

PabTorre
PabTorre

Reputation: 3127

There are a few problems with the snapshot approach that arise from the fact that not all instruments will tick on every minute, specially since you are considering a universe of 30,000 instruments which must include some instruments with lower liquidity that may trade unfrequently.

The rolling updates approach has the problem of having timestamps all over the place, which may complicate things when querying the data.

A third approach that combines them both works best, you keep a temporary record of the "rolling update" for all instruments in-memory on your parser, and on the 10 minute mark you write the latest value to the permanent table and reboot the temporary records. This approach also makes it easy to keep track of Open, High, Low, Close and Volume values.

Upvotes: 0

Aryeh Leib Taurog
Aryeh Leib Taurog

Reputation: 5598

I'm re-implementing our feed and I'm switching from rolling updates to snapshots. It was easier to code; I don't have to keep track of when to store what. The data are loaded into a carefully indexed PostgreSQL table using binary copy, so insert performance isn't an issue; we're seeing rates of at least a few thousand records/sec, which is sufficient.

I am not using specifically round timestamps, but that would make it even easier to retrieve the entire snapshot, should we want to do so. At this point, we only retrieve data for one symbol at a time, at a single point in time.

Most of the symbols we deal with change much more than once every 10 minutes, so in any case our data set doesn't reflect the frequency of change in these symbols.

Update: We've started making more extensive use of the historical data. The ease with which we can now retrieve larger blocks of data for a single point in time is a real boon.

Upvotes: 0

Related Questions