asof join with Julia data tools

Question

I am hoping to do something along the lines of pandas merge_asof or QuestDB's ASOF JOIN in Julia. Critically, I also need to apply a group-by operation.

I would be happy to use any of Julia's Table.jl respecting tools. DataFrame's leftjoin get's close, but requires exact key matches, and doesn't do grouping (as far as I can tell). SplitApplyCombine.jl's leftgroupjoin allows you to pass in your own comparison function, but I don't quite see how to use that function to specify the "nearest less than" value, or "nearest greater than" value.

For a simple example where group-bys are not necessary, on two tables left and right, each with a column time, I could use a function like

function find_nearest_before(val, data)
    findlast(x -> x <= val, data)
end

[find_nearest_before(t, right.time) for t in left.time]

and this would get me the indices in right to join to left. However, I don't quite see how to put this together with a group-by.

EDIT

Adding an example to make the question more clear. The first table sensor_pings reports when a sensor sees something. The second table in_sensor_FOV tells us what object is actually in a sensor's field of view (FOV) at a given time. Assume a sensor only has one object in its FOV at a time (opposite is not necessarily true).

julia> using TypedTables

julia> sensor_pings = Table(time=[4,5,7,8,9,10,11,13,15,16], sensor_id=[2,1,1,3,2,3,1,2,3,2])
Table with 2 columns and 10 rows:
      time  sensor_id
    ┌────────────────
 1  │ 4     2
 2  │ 5     1
 3  │ 7     1
 4  │ 8     3
 5  │ 9     2
 6  │ 10    3
 7  │ 11    1
 8  │ 13    2
 9  │ 15    3
 10 │ 16    2

julia> in_sensor_FOV = Table(time=[1.3,2.6,3.8,5.9,7.3,8.0,12.3,14.7], sensor_id=[3,1,2,3,2,2,3,1], object_in_sensor_FOV=[:a,:b,:c,:b,:c,:a,:c,:b])
Table with 3 columns and 8 rows:
     time  sensor_id  object_in_sensor_FOV
   ┌──────────────────────────────────────
 1 │ 1.3   3          a
 2 │ 2.6   1          b
 3 │ 3.8   2          c
 4 │ 5.9   3          b
 5 │ 7.3   2          c
 6 │ 8.0   2          a
 7 │ 12.3  3          c
 8 │ 14.7  1          b

The end result of the desired operation would look like

julia> Table(time=[4,5,7,8,9,10,11,13,15,16], sensor_id=[2,1,1,3,2,3,1,2,3,2], object_in_sensor_FOV=[:c,:b,:b,:b,:a,:b,:b,:a,:c,:a])

Table with 3 columns and 10 rows:
      time  sensor_id  object_in_sensor_FOV
    ┌──────────────────────────────────────
 1  │ 4     2          c
 2  │ 5     1          b
 3  │ 7     1          b
 4  │ 8     3          b
 5  │ 9     2          a
 6  │ 10    3          b
 7  │ 11    1          b
 8  │ 13    2          a
 9  │ 15    3          c
 10 │ 16    2          a

Andrej Oskin · Accepted Answer

It's rather easy to write something like that, you just need to implement double cursor

using TypedTables
using Setfield

sensor_pings = Table(time=[4,5,7,8,9,10,11,13,15,16], sensor_id=[2,1,1,3,2,3,1,2,3,2])

in_sensor_FOV = Table(time=[1.3,2.6,3.8,5.9,7.3,8.0,12.3,14.7], sensor_id=[3,1,2,3,2,2,3,1], object_in_sensor_FOV=[:a,:b,:c,:b,:c,:a,:c,:b])

function mergeasof(t1, t2)
    objects = similar(t2.object_in_sensor_FOV, length(t1.time))
    d = ntuple(_ -> :z, 3) # :z is a sentinel value, means that there were no objects up to this moment. Can be anything
    i2 = 1
    # Double cursor
    for i1 in axes(t1, 1)
        tm1 = t1.time[i1]
        # updating `d` to the current time step 
        while i2 <= length(t2.time)
            t2.time[i2] > tm1 && break
            @set! d[t2.sensor_id[i2]] = t2.object_in_sensor_FOV[i2]
            i2 += 1
        end
        objects[i1] = d[t1.sensor_id[i1]]
    end

    return Table(time = t1.time, sensor_id = t1.sensor_id, object_in_sensor_FOV = objects)
end

julia> mergeasof(sensor_pings, in_sensor_FOV)
Table with 3 columns and 10 rows:
      time  sensor_id  object_in_sensor_FOV
    ┌──────────────────────────────────────
 1  │ 4     2          c
 2  │ 5     1          b
 3  │ 7     1          b
 4  │ 8     3          b
 5  │ 9     2          a
 6  │ 10    3          b
 7  │ 11    1          b
 8  │ 13    2          a
 9  │ 15    3          c
 10 │ 16    2          a

it should be rather fast and could be adapted for an arbitrary number of columns (it's just more tedious to right).

Few notes, though

This function expects that tables are sorted over time
It can be adapted to forward search, yet it can be more tedious.
I am using the fact that there are 3 sensors. If the amount of sensors is known beforehand, then it can should be used in ntuple function. If it is unknown or large or indices are arbitrary, then instead of ntuple you can use Dict

d = Dict{Int, Symbol}()

and @set! should be removed

d[t2.sensor_id[i2]] = t2.object_in_sensor_FOV[i2]

and instead of

objects[i1] = d[t1.sensor_id[i1]]

you should use

objects[i1] = get(d, t1.sensor_id[i1], :z)

asof join with Julia data tools

Answers (2)

Related Questions