Reputation: 840
I am hoping to do something along the lines of pandas merge_asof
or QuestDB's ASOF JOIN
in Julia. Critically, I also need to apply a group-by operation.
I would be happy to use any of Julia's Table.jl
respecting tools. DataFrame's leftjoin
get's close, but requires exact key matches, and doesn't do grouping (as far as I can tell). SplitApplyCombine.jl's leftgroupjoin
allows you to pass in your own comparison function, but I don't quite see how to use that function to specify the "nearest less than" value, or "nearest greater than" value.
For a simple example where group-bys are not necessary, on two tables left
and right
, each with a column time
, I could use a function like
function find_nearest_before(val, data)
findlast(x -> x <= val, data)
end
[find_nearest_before(t, right.time) for t in left.time]
and this would get me the indices in right
to join to left
. However, I don't quite see how to put this together with a group-by.
EDIT
Adding an example to make the question more clear. The first table sensor_pings
reports when a sensor sees something. The second table in_sensor_FOV
tells us what object is actually in a sensor's field of view (FOV) at a given time. Assume a sensor only has one object in its FOV at a time (opposite is not necessarily true).
julia> using TypedTables
julia> sensor_pings = Table(time=[4,5,7,8,9,10,11,13,15,16], sensor_id=[2,1,1,3,2,3,1,2,3,2])
Table with 2 columns and 10 rows:
time sensor_id
┌────────────────
1 │ 4 2
2 │ 5 1
3 │ 7 1
4 │ 8 3
5 │ 9 2
6 │ 10 3
7 │ 11 1
8 │ 13 2
9 │ 15 3
10 │ 16 2
julia> in_sensor_FOV = Table(time=[1.3,2.6,3.8,5.9,7.3,8.0,12.3,14.7], sensor_id=[3,1,2,3,2,2,3,1], object_in_sensor_FOV=[:a,:b,:c,:b,:c,:a,:c,:b])
Table with 3 columns and 8 rows:
time sensor_id object_in_sensor_FOV
┌──────────────────────────────────────
1 │ 1.3 3 a
2 │ 2.6 1 b
3 │ 3.8 2 c
4 │ 5.9 3 b
5 │ 7.3 2 c
6 │ 8.0 2 a
7 │ 12.3 3 c
8 │ 14.7 1 b
The end result of the desired operation would look like
julia> Table(time=[4,5,7,8,9,10,11,13,15,16], sensor_id=[2,1,1,3,2,3,1,2,3,2], object_in_sensor_FOV=[:c,:b,:b,:b,:a,:b,:b,:a,:c,:a])
Table with 3 columns and 10 rows:
time sensor_id object_in_sensor_FOV
┌──────────────────────────────────────
1 │ 4 2 c
2 │ 5 1 b
3 │ 7 1 b
4 │ 8 3 b
5 │ 9 2 a
6 │ 10 3 b
7 │ 11 1 b
8 │ 13 2 a
9 │ 15 3 c
10 │ 16 2 a
Upvotes: 3
Views: 279
Reputation: 2342
It's rather easy to write something like that, you just need to implement double cursor
using TypedTables
using Setfield
sensor_pings = Table(time=[4,5,7,8,9,10,11,13,15,16], sensor_id=[2,1,1,3,2,3,1,2,3,2])
in_sensor_FOV = Table(time=[1.3,2.6,3.8,5.9,7.3,8.0,12.3,14.7], sensor_id=[3,1,2,3,2,2,3,1], object_in_sensor_FOV=[:a,:b,:c,:b,:c,:a,:c,:b])
function mergeasof(t1, t2)
objects = similar(t2.object_in_sensor_FOV, length(t1.time))
d = ntuple(_ -> :z, 3) # :z is a sentinel value, means that there were no objects up to this moment. Can be anything
i2 = 1
# Double cursor
for i1 in axes(t1, 1)
tm1 = t1.time[i1]
# updating `d` to the current time step
while i2 <= length(t2.time)
t2.time[i2] > tm1 && break
@set! d[t2.sensor_id[i2]] = t2.object_in_sensor_FOV[i2]
i2 += 1
end
objects[i1] = d[t1.sensor_id[i1]]
end
return Table(time = t1.time, sensor_id = t1.sensor_id, object_in_sensor_FOV = objects)
end
julia> mergeasof(sensor_pings, in_sensor_FOV)
Table with 3 columns and 10 rows:
time sensor_id object_in_sensor_FOV
┌──────────────────────────────────────
1 │ 4 2 c
2 │ 5 1 b
3 │ 7 1 b
4 │ 8 3 b
5 │ 9 2 a
6 │ 10 3 b
7 │ 11 1 b
8 │ 13 2 a
9 │ 15 3 c
10 │ 16 2 a
it should be rather fast and could be adapted for an arbitrary number of columns (it's just more tedious to right).
Few notes, though
time
ntuple
function. If it is unknown or large or indices are arbitrary, then instead of ntuple you can use Dictd = Dict{Int, Symbol}()
and @set!
should be removed
d[t2.sensor_id[i2]] = t2.object_in_sensor_FOV[i2]
and instead of
objects[i1] = d[t1.sensor_id[i1]]
you should use
objects[i1] = get(d, t1.sensor_id[i1], :z)
Upvotes: 1
Reputation: 13800
Here's one way of doing it in DataFrames - this is certainly not the peak of efficiency, but if your data is small enough that you can afford the first leftjoin
it might be good enough.
Start by joining in_sensor_FOV
onto sensor_pings
:
julia> df = leftjoin(sensor_pings, in_sensor_FOV, on = :sensor_id, makeunique = true);
after that you'll have multiple rows for each sensor in sensor_pings
, which is where this approach might fail if your data is large.
Then get the time difference:
julia> transform!(df, [:time, :time_1] => ((x, y) -> x - y) => :time_diff);
Now your findlast
approach iiuc means we only consider rows with positive time difference:
julia> df = df[df.time_diff .> 0.0, :];
Then we sort by sensor and time diff and pick the first row for each sensor:
julia> res = combine(groupby(sort(df, [:sensor_id, :time_diff]), [:sensor_id, :time]), names(df[:, Not([:sensor_id, :time])]) .=> first .=> names(df[:, Not([:sensor_id, :time])]));
Result (sorted to produce the same output):
julia> sort(select(res, [:time, :sensor_id, :object_in_sensor_FOV]), :time)
10×3 DataFrame
Row │ time sensor_id object_in_sensor_FOV
│ Int64 Int64 Symbol
─────┼────────────────────────────────────────
1 │ 4 2 c
2 │ 5 1 b
3 │ 7 1 b
4 │ 8 3 b
5 │ 9 2 a
6 │ 10 3 b
7 │ 11 1 b
8 │ 13 2 a
9 │ 15 3 c
10 │ 16 2 a
Upvotes: 1