Reputation: 4716
What would be the optimal partitioning (if any) of kdb+
database storing daily price/return data on 10,000
symbols (e.g. stocks, indices, ETFs)?
We are talking about 25
years of daily data (median) per symbol (25 * 200 = 5,000 records
). So the overall size will be 10,000
symbols x 5,000
days = 50,000,000
records.
The database will write a new price for each symbol at the end of the day.
The most typical query will be to read the whole daily price history for a subset (10s even 100s) of symbols into the memory for further time series analysis/portfolio simulation.
I was thinking about partitioning by symbol, but didn't find anyone on the web doing it that way, so not sure if this is the best idea.
I've found plenty of solutions for higher-than-daily frequency where they partition tick data by day (each date has own partition, e.g. 2015-02-12
), or partition by symbol range (e.g. ABC DEF GHI ...
), but not by individual symbols.
In the backtesting/portfolio simulation application[1], I suspect that partitioning by individual symbols may have some later advantages by being able to calculate price transformations (such as moving average or RSI) in parallel by giving each worker node it's own symbol to process. On the contrary, partitioning by day/week/year instead would not provide that advantage.
[1] Which is essentially first loop each symbol and pre-process its time-series into transformed time series to generate signal (e.g. calculate moving average of price). Then loop day-by-day in a path-dependent (where portfolio is a stateful object) fashion to check and act upon signals each day
Upvotes: 0
Views: 611
Reputation: 13657
As @GilbertLeBlanc says, 50m is certainly not much and if that was the expected size it would simply be best (in this use-case) to splay the table (i.e. no partioning at all) along with using attributes on the columns that you will use for filtering (p# or
g# on ticker since you'll be filtering on that, taking care to maintain those attributes in the event of new data being upserted). But obviously the splay table approach won't scale well if you do plan to scale it much (beyond 100m rows).
Date partitioning would be a bad idea if you primarily plan to search for all historic data for a particular sym (since the query will have to trawl through every date partition = many many disk reads).
Upvotes: 1