Gabi
Gabi

Reputation: 453

Tick data in kdb partitioned-by-date table: should I use a time or datetime column?

What is the standard canonical temporal column for a tick-level dataset (trades or quotes) in q/kdb+ when using a table partitioned by date? Should I have a time or a datetime column in addition to the partition? I'm seeing the following pros and cons, but maybe I'm missing something:

Timestamp columns: Con: I'm storing redundant data, both on disk and when displayed; so a select from table where date=max date effectively shows me the date twice; but Pro: it's much easier to interface with Python and I can more easily re-partition the data if I wanted to

Time columns: Con: Messier to interface with Python; Pro: I don't see the date twice when doing a select

I am aware of the sizes in bytes of the individual datatypes, so I understand I may not be saving space if for example I would be replacing a timespan (`n) with a timestamp (`p), but I can't for the life of me ever remember doing a select (I used to work in a bank with q as well) and ever seeing the date twice!

Upvotes: 1

Views: 339

Answers (2)

Matt Moore
Matt Moore

Reputation: 2800

It really depends on the data set. If you previously worked in a bank—it may have been in equities—which has defined market open hours by region/exchange and therefore a timestamp is unnecessary alongside the date. Other assets, such as FX, are 24/7 so the timestamp is more important. As you say, a timespan will not save space, but if you care about precision beyond milliseconds it may be necessary, e.g., time is half the bytes at the cost of precision.

It is worth noting that the date column is virtual, based on partition, so it does not cost anything to store on disk as it's just the directory names. This will of course cost memory when loaded in and sent to your Python application (I am not sure what the overhead is on the q side, but it will cost more for your Python application with transfer, etc.). If it is easier to work with a timestamp in Python then why not just keep it?

If you care about the date appearing twice, you could lazily delete it after select:

delete date from select from t

Or ignore it in a select based on columns

select col1, col2 from t

Upvotes: 3

terrylynch
terrylynch

Reputation: 13612

One of the most important considerations for the time column in on-disk tick data is whether or not it can be used in an as-of join aj. For this you want the time column to be sorted within each ticker/sym (assuming you have p# on ticker/sym column, which is recommended for tick data), and so long as this time column has an underlying long/integer type (which both timestamp and timespan do, datetime doesn't) then an aj will be optimal.

So the fact that you "see" the date again isn't a big concern. One other thing to be aware of: if you do go with timestamps, it is possible that your timestamps can span more than one date within a given date slice? i.e. could you have records from the day before? If so then something like this:

select from table where date=2022.02.01, time>23:00

could theoretically return the records which are >23:00 from both dates.

Upvotes: 4

Related Questions