Will
Will

Reputation: 942

Filling missing data in a table

I have a problem with missing data. So I have a returns table created this way:

returns:update logret:log ret from update ret:{0.0, 1_deltas x} mid by sym from spots;

and looking like this:

meta returns
c     | t f a
------| -----
date  | d    
time  | p    
sym   | s    
mid   | f    
ret   | f    
logret| f    

and I pivot it this way:

rettbl:0!exec (raze `$raze ( string (distinct sym)),\:/:("_",/:string `mid`ret`logret))!(mid, ret, logret) by time:time from returns;

so that works on a small range of time, but when I extend it fails with a "type: Mismatched types"

I believe the problem is that I don't have the same observations for all sym in my returns table, as
select count time by sym from returns
gives me a different count by sym.

So my question is: how would you go about filling any missing mid for all times I have in my table returns? To make it quick and dirty i was thinking about having a linear interpolation for missing points between two `time ...
Any hint, link, etc... would be greatly appreciated.

EDIT:
I can't share the data, but this should allow to reproduce the issue:

mid1: 1.2 + ({rand 1.0} each til 10) %100
mid2: 0.8 + ({rand 1.0} each til 10) %100
mid3: 104 + ({rand 1.0} each til 10) %100
sym1:{`$"EUR/USD"} each til 10
sym2:{`$"GBP/USD"} each til 10
sym3:{`$"USD/JPY"} each til 10
time1:2020.07.01D00:00:00.000000000 + 1D00:00:00 * til 10
/time2:2020.07.01D00:00:00.000000000 + 1D00:00:00 * til 10
time2:(2020.07.01D00:00:00.000000000 + 1D00:00:00 * til 3) , (2020.07.06D00:00:00.000000000 + 1D00:00:00 * til 3) , (2020.07.10D00:00:00.000000000 + 1D00:00:00 * til 4)
time3:2020.07.01D00:00:00.000000000 + 1D00:00:00 * til 10

spots:([] sym:sym1,sym2,sym3; time:time1,time2,time3; mid:mid1,mid2,mid3)
spots:update date:"d"$time from spots
returns:update logret:log ret from update ret:{0.0, 1_deltas x} mid by sym from spots;
rettbl:0!exec (raze `$raze ( string (distinct sym)),\:/:("_",/:string `mid`ret`logret))!(mid, ret, logret) by time:time from returns;

do note the time2 definition.
as per the desired output, if you replace the time2 definition with the one I commented out you will see.
as per the solution, I suspect that adding to the table returns the missing time, so that the list of time is the same for all sym are identical will solve the issue, ie value select time by sym from returns should be identical, and a quick a dirty work around would be to add missing mid with a linear interpolation of the surrounding mid.

Upvotes: 0

Views: 697

Answers (1)

Matt Moore
Matt Moore

Reputation: 2800

The P# is pretty crucial here in order to produce the pivot properly. This will place nulls where time is missing for a sym col. https://code.kx.com/q/kb/pivoting-tables/

{

    P:exec (raze `$raze ( string (distinct sym)),\:/:("_",/:string `mid`ret`logret)) from returns;
    exec P#(raze `$raze ( string (distinct sym)),\:/:("_",/:string `mid`ret`logret))!(mid, ret, logret) by time:time from returns

    }[]

Output:
time                         | EUR/USD_mid GBP/USD_mid USD/JPY_mid EUR/USD_re..
-----------------------------| ----------------------------------------------..
2020.07.01D00:00:00.000000000| 1.203915    0.8078337   104.0062    0         ..
2020.07.02D00:00:00.000000000| 1.200812    0.8040996   104.0093    -0.0031030..
2020.07.03D00:00:00.000000000| 1.209368    0.8061088   104.0027    0.00855514..
2020.07.04D00:00:00.000000000| 1.202782                104.0006    -0.0065853..
2020.07.05D00:00:00.000000000| 1.202392                104.0026    -0.0003897..
2020.07.06D00:00:00.000000000| 1.201508    0.8049765   104.0023    -0.0008842..
2020.07.07D00:00:00.000000000| 1.201567    0.8040875   104.0009    5.91839e-0..
2020.07.08D00:00:00.000000000| 1.209785    0.8044973   104.001     0.00821768..
2020.07.09D00:00:00.000000000| 1.207043                104.0087    -0.0027416..
2020.07.10D00:00:00.000000000| 1.209442    0.8001392   104.0073    0.00239835..
2020.07.11D00:00:00.000000000|             0.8071488                         ..
2020.07.12D00:00:00.000000000|             0.8019465                         ..

You are getting a type error because the code is creating a dictionary when time is not present for all syms.

q)0!exec (raze `$raze ( string (distinct sym)),\:/:("_",/:string `mid`ret`logret))!(mid, ret, logret) by time:time from returns
'type
  [0]  0!exec (raze `$raze ( string (distinct sym)),\:/:("_",/:string `mid`ret`logret))!(mid, ret, logret) by time:time from returns
        ^
q)exec (raze `$raze ( string (distinct sym)),\:/:("_",/:string `mid`ret`logret))!(mid, ret, logret) by time:time from returns
time                         |                                               ..
-----------------------------| ----------------------------------------------..
2020.07.01D00:00:00.000000000| `EUR/USD_mid`GBP/USD_mid`USD/JPY_mid`EUR/USD_r..
2020.07.02D00:00:00.000000000| `EUR/USD_mid`GBP/USD_mid`USD/JPY_mid`EUR/USD_r..
2020.07.03D00:00:00.000000000| `EUR/USD_mid`GBP/USD_mid`USD/JPY_mid`EUR/USD_r..
2020.07.04D00:00:00.000000000| `EUR/USD_mid`USD/JPY_mid`EUR/USD_ret`USD/JPY_r..
2020.07.05D00:00:00.000000000| `EUR/USD_mid`USD/JPY_mid`EUR/USD_ret`USD/JPY_r..

Edit: To remove nulls you could use fills to fill forward non-null values. You also may want to use .Q.id to sanitize the table first. Columns such as EUR/USD_mid are tricky to work with due to the confusion with over /.

https://code.kx.com/q/ref/dotq/#qid-sanitize

q)fills x
time                         | EURUSD_mid GBPUSD_mid USDJPY_mid EURUSD_ret   ..
-----------------------------| ----------------------------------------------..
2020.07.01D00:00:00.000000000| 1.203928   0.8049318  104.0047   0            ..
2020.07.02D00:00:00.000000000| 1.205171   0.8057852  104.0063   0.001243387  ..
2020.07.03D00:00:00.000000000| 1.20516    0.8008389  104.0097   -1.11147e-005..
2020.07.04D00:00:00.000000000| 1.204067   0.8008389  104.0023   -0.001093155 ..
2020.07.05D00:00:00.000000000| 1.201781   0.8008389  104.0095   -0.002285803 ..

This will remove most nulls except for any that exist in the first row. I don't have any experience writing something such as linear interpolation in kdb and I don't think it will replace nulls as well as fills. For example, the example data set has 3 nulls at the end in the EURUSD_mid column so can't be interpolated for those 3 entries.

Upvotes: 3

Related Questions