nashar1
nashar1

Reputation: 1087

historic data storage and retrieval

I am using a standard splayed format for my trade data where i have directories for each date and each column as separate file in there. I am reading from csv files and storing using the below code. I am using the trial version 32 bit on win 7, 64 bit.

readDat: {[x]
tmp: read data from csv file(x)
tmp: `sym`time`trdId xasc tmp;
/trd: update `g#sym from trd;
trade:: trd;
.Q.dpft[`:/kdb/ndb; dt; `sym; `trade];
.Q.gc[];
};

\t readDat each 50#dtlist

I have tried both using the `g#sym and without it. Data has typically 1.5MM rows per date. select time for this is from 0.5 to 1 second for a day Is there a way to improve times for either of the below queries.

\t select from trade where date=x
\t select from trade where date=x, sym=y

I have read the docs on segmentation, partitioning etc. but not sure if anything would help here.

On second thoughts, will creating a table for each sym speed up things? I am trying that out but wanted to know if there are memory/space tradeoffs i should be aware of.

Upvotes: 2

Views: 1133

Answers (3)

moepud
moepud

Reputation: 461

Have you done any profiling to see what the actual bottleneck is? If you find that the problem has to do with disk read speed (using something like iostat) you can either get a faster disk (SSD), more memory (for bigger disk cache), or use par.txt to shard your database across multiple disks such that the query happens on multiple disks and cores in parallel.

Upvotes: 1

Rob
Rob

Reputation: 823

As user1895961 mentioned, selecting only certain columns will be faster. KDB splayed\partitioned tables are almost exactly just files on the filesystem, the smaller the files and the fewer you have to read, the faster it will be. The balance between the number of folders and the number of files is key. 1.5mln per partition is ok, but is on the large side. Perhaps you might want to partition by something else.

You may also want to normalise you data, splitting it into multiple tables and using linked columns to join it back again on the fly. Linked columns, if set up correctly, can be very powerful and can help avoid reading too much data back from disk if filtering is added in.

Also try converting your data to char instead of sym, i found big performance increases from doing so.

Upvotes: 0

user1895961
user1895961

Reputation: 1186

As you are using .Q.dpft, you are already partitioning your DB. If your use case is always to pass one date in your queries, then segmenting by date will not provide any performance improvements. You could possibly segment by symbol range (see here), although this is never something I've tried.

One basic way to improve performance would be to select a subset of the columns. Do you really need to read all of the fields when querying? Depending on the width of your table this can have a large impact as it now can ignore some files completely.

Another way to improve performance would be to apply `u# to the sym file. This will speed up your second query as the look up on the sym file will be faster. Although this really depends on the size of your universe. The benefit of this would be marginal in comparison to reducing the number of columns requested I would imagine.

Upvotes: 0

Related Questions