Timka
Timka

Reputation: 1769

Database Implementation Help : Time-Series data

This is the re-submission of my previous question:

I have a collection of ordered time-series data(stock minute price information). My current database structure using PostgreSQL is below:

symbol_table - where I keep the list of the symbols with the symbol_id as a primary key(serial). time_table, date_table - time/date values are stored there. time_id/date_id are primary keys(serial/serial).

My main minute_table contains the minute pricing information where date_id|time_id|symbol_id are primary keys(also foreign keys from the corresponding tables)

Using this main minute_table I'm performing different statistical analyses and keep the results in a separate tables, like one_minute_std - where one minute standard deviation measures are kept.

Every night I'm updating the tables with the current price information from the last day's closing prices.

With the current implementation my tables contain all the symbols with around 50m records each. Primary keys are indexed.

If I want to query for all the symbols where closing price > x and one_minute_std >2 and one_minute_std < 4 for the specific date it takes about 3-4 minutes for the search.

To speed up the process I was thinking of separating each symbol to its own table but not 100% sure if this is a 'proper' way of doing it.

Could you advise me on how I can speed up the query process?

Upvotes: 2

Views: 1541

Answers (1)

Gordon Linoff
Gordon Linoff

Reputation: 1269793

It sounds like you want a combination of approaches.

First, you should look into table partitioning. This stores a single table across multiple storage units ("files"), but still gives you the flexibility of a single table. (Here is postgres documentation http://www.postgresql.org/docs/current/interactive/ddl-partitioning.html).

You would want to partition either by day or by ticker symbol. My first reaction would be by time (day/week/month), since that is the unit of updates. However, if you analyses are only by a single ticker and often span multiple days, then there is an argument for using that instead.

After partitioning, you may want to consider indexes. However, I suspect that partitioning will solve your performance problems.

Since your updates are at night, you should be folding in your summarization process in with the updates. For instance, one_minute_std should be calculated during this process. You might find it best to load the nightly data into a temporary table, do the calculation for summaries such as one_minute_std, and then load the data into the final partitioned table scheme.

With so many rows that have so few columns, you are probably better off with a good partitioning scheme than an indexing scheme. In particular, indexes have a space overhead, and the smaller the record in each row, the more that using the index incurs an overhead comparable to scanning the entire table.

Upvotes: 4

Related Questions