ks-man
ks-man

Reputation: 167

KDB - Historical Table - what's the "right" way to doing it?

I have a point of sale system that exports my data to a .csv which I then import into KDB. Currently what I do is export all the data from the POS to the csv and then create a single table. I have about 10 months of sales data and my csv file is about 11mb. As time grows I imagine that the csv file will be really huge and I wonder if this is inefficient.

At my old job what we would do is have a table for each individual day of data and then there would be an _hist table which would combine all the daily files. So if I wanted to just look at the data for the day I would look at the invoicedata table and if I wanted to look across all time I would look at the invoidata_hist table and set the query to look at date within (dateA;dateB). I'm wondering if I should set things up this way rather than the way I'm doing it now.

Am I better off having one very large csv file with all the data in it or should I create a csv file for each individual day? If the second way is better can anybody give me an idea of the best way to set this all up?

Thanks!

Upvotes: 2

Views: 1077

Answers (1)

nyi
nyi

Reputation: 3229

If your total number of records are not going to exceed a few million then partitioning might be an overkill.

I'll think of partitioning the data if my daily table count is around 1 million or more.

You also need to consider how you are going to access the data, e.g. checking last-n records of an in-frequent customer in date partitioned table might impact your query performance as you have to iteratively lookback. In that case, a splayed or yearly partitioned table might be the right fit.

Having said that there are multiple ways you can store the data on disk, check out this link.

  • binary serialized (stores the table as a binary block)

    `:/db/t set ([] ti:09:30:00 09:31:00; p:101.5 33.5)

  • splayed table (vertical/column partitioned )

    `:/db/t/ set ([] ti:09:30:00 09:31:00; p:101.5 33.5) // trailing "/" in the file handle

  • partitioned table ( horizontally partitioned splayed table)

    .Q.dpft[directory;partition;`p#field;tablename]

  • segmented tables ( advance partition for better query/map-reduce results)

    .Q.dpft[directory;partition;`p#field;tablename]

  • save function - to save data as binary/xml/csv/txt/xml format.

Since you have asked for a date partitioned table in the question, actually there are different ways you can partition your data :

  • daily
  • monthly
  • yearly
  • long (customizable on any long column )

You might want to store the data in a monthly partition based on the table count.

For saving the data to a partition, you can use .Q.dpft function

.Q.dpft[directory;partition;`p#field;tablename] 

example from code.kx:

q)trade:([]sym:10?`a`b`c;time:.z.T+10*til 10;price:50f+10?50f;size:100*1+10?10)
q).Q.dpft[`:db;2007.07.23;`sym;`trade]
`trade

Upvotes: 2

Related Questions