Reputation: 19466
I have five years of minute bars for around 10,000 symbols, as CSV files. It totals around 50GB of text. My RAM is 32GB.
I'm trying to load all this data into a KDB table, so it's easily queryable.
symbols: `$(-4_')(string') key `:/home/chris/sync/us_equities
f: {`$(":/home/chris/sync/us_equities/", x, ".csv")}
load_symbol: {(0!(update "P"$-1_/:t, s: x from flip `t`o`h`l`c`v!("*FFFFI";",")0: f(string x))) }
({`:/home/chris/sync/price_data insert (load_symbol x)}) each symbols
insert
, is painfully slow. It looks like it would take around a day to process, perhaps longer. How might I optimize it? I tried peach
but this was even slower. It looks like it starts out very fast, and gets slower with each step of the each
.Thank you!
Upvotes: 0
Views: 1013
Reputation: 1780
Using a flat file is in this case is not recommended due to the data size and the frequency of updating it. The file needs to be recreated from scratch each time you insert leading to the insert time being linear to the total row count.
q)t:([]a:til 10000000;b:til 10000000)
q)`:t set t
`:t
q)\t `:t insert t
305
q)\t `:t insert t
365
q)\t `:t insert t
574
q)\t `:t insert t
809
q)\t `:t insert t
1236
q)\t `:t insert t
2687
q)\t `:t insert t
3200
Compare this to a splayed table, where each column in the new data is appended to the correspond file, resulting in constant inserts.
q)t:([]a:til 10000000;b:til 10000000)
q)`:t/ set t
`:t/
q)\t `:t insert t
166
q)\t `:t insert t
101
q)\t `:t insert t
97
q)\t `:t insert t
100
q)\t `:t insert t
111
q)\t `:t insert t
113
If the symbol is not already in the file then it would be wise to add it onto the table. However I would suggest naming the column sym instead of s. This is only due to it being conventional within kdb and that some built functions assume this name.
From my estimates, this table will be far too large for a simple splayed table. I would partition it by either date or month, depending on the types of queries you are running.
Sorting by sym and adding a parted attribute is a must if your queries will often select a subset of the syms. Sorting by time bucket within each sym is required to use an asof join, as it uses binary searches.
The following code will do this, but as your files are already separated by sym, you should be able to skip the sym sort.
/ to sort a table in memory and apply parted attribute
update `p#sym from `sym`time xasc data
/ to sort a table on disk and apply parted attribute
sym`time xasc `:path/to/partition
@[`:path/to/partition;`sym;`p#]
If your queries are more geared around selecting a specific time window across all symbols, you may be better sorting only by time bucket and apply a sorted attribute to this column.
Additionally you may want to consider streaming the csv files using .Q.fs or .Q.fsn to reduce the memory usage of any single load. This would allow you to use multi-threading or additional processes to load in the data with the same memory overhead.
Upvotes: 2