natemcintosh
natemcintosh

Reputation: 840

asof join with Julia data tools

I am hoping to do something along the lines of pandas merge_asof or QuestDB's ASOF JOIN in Julia. Critically, I also need to apply a group-by operation.

I would be happy to use any of Julia's Table.jl respecting tools. DataFrame's leftjoin get's close, but requires exact key matches, and doesn't do grouping (as far as I can tell). SplitApplyCombine.jl's leftgroupjoin allows you to pass in your own comparison function, but I don't quite see how to use that function to specify the "nearest less than" value, or "nearest greater than" value.

For a simple example where group-bys are not necessary, on two tables left and right, each with a column time, I could use a function like

function find_nearest_before(val, data)
    findlast(x -> x <= val, data)
end

[find_nearest_before(t, right.time) for t in left.time]

and this would get me the indices in right to join to left. However, I don't quite see how to put this together with a group-by.


EDIT

Adding an example to make the question more clear. The first table sensor_pings reports when a sensor sees something. The second table in_sensor_FOV tells us what object is actually in a sensor's field of view (FOV) at a given time. Assume a sensor only has one object in its FOV at a time (opposite is not necessarily true).

julia> using TypedTables

julia> sensor_pings = Table(time=[4,5,7,8,9,10,11,13,15,16], sensor_id=[2,1,1,3,2,3,1,2,3,2])
Table with 2 columns and 10 rows:
      time  sensor_id
    ┌────────────────
 1  │ 4     2
 2  │ 5     1
 3  │ 7     1
 4  │ 8     3
 5  │ 9     2
 6  │ 10    3
 7  │ 11    1
 8  │ 13    2
 9  │ 15    3
 10 │ 16    2

julia> in_sensor_FOV = Table(time=[1.3,2.6,3.8,5.9,7.3,8.0,12.3,14.7], sensor_id=[3,1,2,3,2,2,3,1], object_in_sensor_FOV=[:a,:b,:c,:b,:c,:a,:c,:b])
Table with 3 columns and 8 rows:
     time  sensor_id  object_in_sensor_FOV
   ┌──────────────────────────────────────
 1 │ 1.3   3          a
 2 │ 2.6   1          b
 3 │ 3.8   2          c
 4 │ 5.9   3          b
 5 │ 7.3   2          c
 6 │ 8.0   2          a
 7 │ 12.3  3          c
 8 │ 14.7  1          b

The end result of the desired operation would look like

julia> Table(time=[4,5,7,8,9,10,11,13,15,16], sensor_id=[2,1,1,3,2,3,1,2,3,2], object_in_sensor_FOV=[:c,:b,:b,:b,:a,:b,:b,:a,:c,:a])

Table with 3 columns and 10 rows:
      time  sensor_id  object_in_sensor_FOV
    ┌──────────────────────────────────────
 1  │ 4     2          c
 2  │ 5     1          b
 3  │ 7     1          b
 4  │ 8     3          b
 5  │ 9     2          a
 6  │ 10    3          b
 7  │ 11    1          b
 8  │ 13    2          a
 9  │ 15    3          c
 10 │ 16    2          a

Upvotes: 3

Views: 279

Answers (2)

Andrej Oskin
Andrej Oskin

Reputation: 2342

It's rather easy to write something like that, you just need to implement double cursor

using TypedTables
using Setfield

sensor_pings = Table(time=[4,5,7,8,9,10,11,13,15,16], sensor_id=[2,1,1,3,2,3,1,2,3,2])

in_sensor_FOV = Table(time=[1.3,2.6,3.8,5.9,7.3,8.0,12.3,14.7], sensor_id=[3,1,2,3,2,2,3,1], object_in_sensor_FOV=[:a,:b,:c,:b,:c,:a,:c,:b])

function mergeasof(t1, t2)
    objects = similar(t2.object_in_sensor_FOV, length(t1.time))
    d = ntuple(_ -> :z, 3) # :z is a sentinel value, means that there were no objects up to this moment. Can be anything
    i2 = 1
    # Double cursor
    for i1 in axes(t1, 1)
        tm1 = t1.time[i1]
        # updating `d` to the current time step 
        while i2 <= length(t2.time)
            t2.time[i2] > tm1 && break
            @set! d[t2.sensor_id[i2]] = t2.object_in_sensor_FOV[i2]
            i2 += 1
        end
        objects[i1] = d[t1.sensor_id[i1]]
    end

    return Table(time = t1.time, sensor_id = t1.sensor_id, object_in_sensor_FOV = objects)
end

julia> mergeasof(sensor_pings, in_sensor_FOV)
Table with 3 columns and 10 rows:
      time  sensor_id  object_in_sensor_FOV
    ┌──────────────────────────────────────
 1  │ 4     2          c
 2  │ 5     1          b
 3  │ 7     1          b
 4  │ 8     3          b
 5  │ 9     2          a
 6  │ 10    3          b
 7  │ 11    1          b
 8  │ 13    2          a
 9  │ 15    3          c
 10 │ 16    2          a

it should be rather fast and could be adapted for an arbitrary number of columns (it's just more tedious to right).

Few notes, though

  1. This function expects that tables are sorted over time
  2. It can be adapted to forward search, yet it can be more tedious.
  3. I am using the fact that there are 3 sensors. If the amount of sensors is known beforehand, then it can should be used in ntuple function. If it is unknown or large or indices are arbitrary, then instead of ntuple you can use Dict
d = Dict{Int, Symbol}()

and @set! should be removed

d[t2.sensor_id[i2]] = t2.object_in_sensor_FOV[i2]

and instead of

objects[i1] = d[t1.sensor_id[i1]]

you should use

objects[i1] = get(d, t1.sensor_id[i1], :z)

Upvotes: 1

Nils Gudat
Nils Gudat

Reputation: 13800

Here's one way of doing it in DataFrames - this is certainly not the peak of efficiency, but if your data is small enough that you can afford the first leftjoin it might be good enough.

Start by joining in_sensor_FOV onto sensor_pings:

julia> df = leftjoin(sensor_pings, in_sensor_FOV, on = :sensor_id, makeunique = true);

after that you'll have multiple rows for each sensor in sensor_pings, which is where this approach might fail if your data is large.

Then get the time difference:

julia> transform!(df, [:time, :time_1] => ((x, y) -> x - y) => :time_diff);

Now your findlast approach iiuc means we only consider rows with positive time difference:

julia> df = df[df.time_diff .> 0.0, :];

Then we sort by sensor and time diff and pick the first row for each sensor:

julia> res = combine(groupby(sort(df, [:sensor_id, :time_diff]), [:sensor_id, :time]), names(df[:, Not([:sensor_id, :time])]) .=> first .=> names(df[:, Not([:sensor_id, :time])]));

Result (sorted to produce the same output):

julia> sort(select(res, [:time, :sensor_id, :object_in_sensor_FOV]), :time)
10×3 DataFrame
 Row │ time   sensor_id  object_in_sensor_FOV 
     │ Int64  Int64      Symbol               
─────┼────────────────────────────────────────
   1 │     4          2  c
   2 │     5          1  b
   3 │     7          1  b
   4 │     8          3  b
   5 │     9          2  a
   6 │    10          3  b
   7 │    11          1  b
   8 │    13          2  a
   9 │    15          3  c
  10 │    16          2  a

Upvotes: 1

Related Questions