Reputation: 107
I have multiple time series with non-aligning time values. Plotted below. Simple time-pressure data. How would I compute a difference between a pressure in df1 with that in df2 when the rows have slightly different timestamps and the frames have different shapes (df2 has many more rows)?
If the timestamps aligned between dataframes, and the frames had the same number of rows, I would expect this very simple intermediate column would suffice, but clearly in this case it won't. (Each dataset comes from a different device)
How do I generate a new pair of time-pressure columns having some sort of fixed/specific time interval and interpolated values from the original time-pressure columns?
Or is there a more sensible/easier way to achieve what I want to do (essentially, I just want to accurately quantify the differences between "208417" and the other series)?
Upvotes: 4
Views: 143
Reputation: 18217
The general idea I propose, is to use a neutral regular grid of evaluation points to calculate difference between irregular datasets. In order to evaluate datasets on this grid we use linear interpolation. The comments in the code go through the steps:
using Dates
using DataFrames
using Interpolations
# setup some random data
startdate = DateTime("2022-10-10", "yyyy-mm-dd")
df = DataFrame(
time = startdate .+ Second.(rand(1:500_000, 20)),
pressure = 10 .+ rand(20)
)
df2 = DataFrame(
time = startdate .+ Second.(rand(1:500_000, 20)),
pressure = 9 .+ rand(20)
)
# sort data by timestamp needed for interpolation
sort!(df, order(:time))
sort!(df2, order(:time))
# create numeric time value for interpolation
# TODO: suggestions for better method welcome!
time2val(x) = x.instant.periods.value
df.time2 = time2val.(df.time)/1000.0;
df2.time2 = time2val.(df2.time)/1000.0;
# create interpolation objects
itp = interpolate((df.time2,), df.pressure, Gridded(Linear()));
itp2 = interpolate((df2.time2,), df2.pressure, Gridded(Linear()));
# calculate a grid to evaluate difference on
N = 2*(nrow(df)+nrow(df2))
mintime = max(minimum(df2.time2), minimum(df.time2))
maxtime = min(maximum(df2.time2), maximum(df.time2))
interprange = range(mintime, maxtime, N)
# calculate average difference on interpolated grid
sum(itp[t]-itp2[t] for t in interprange)/N
The result is ~1.12
which stems from the data generation difference between 9
and 10
offsets on random pressure
s.
Upvotes: 3
Reputation: 106
when the rows have slightly different timestamps a closejoin must used first,e.g.
using InMemoryDatasets
df1=Dataset(Time=[Date("2022-09-28"),Date("2022-10-02"),Date("2022-10-03"),Date("2022-10-04")],Pressure=[10,11,9,15])
df2=Dataset(Time=[Date("2022-10-01"),Date("2022-10-02"),Date("2022-10-05")],Pressure=[7,9,19])
closejoin!(df1,df2,on=[:Time],makeunique=true,border=:missing)
modify!(df1,[:Pressure,:Pressure_1]=>byrow(-)=>:Difference1)
@df df1 plot(:Time,:Difference1,label="Difference1",color=:red)
Upvotes: 6