byezy
byezy

Reputation: 107

Calculate differences between different times series with non-aligning times using Julia and Dataframes?

I have multiple time series with non-aligning time values. Plotted below. Simple time-pressure data. How would I compute a difference between a pressure in df1 with that in df2 when the rows have slightly different timestamps and the frames have different shapes (df2 has many more rows)?

simple data plot

If the timestamps aligned between dataframes, and the frames had the same number of rows, I would expect this very simple intermediate column would suffice, but clearly in this case it won't. (Each dataset comes from a different device)

wishfull thinking

How do I generate a new pair of time-pressure columns having some sort of fixed/specific time interval and interpolated values from the original time-pressure columns?

Or is there a more sensible/easier way to achieve what I want to do (essentially, I just want to accurately quantify the differences between "208417" and the other series)?

Upvotes: 4

Views: 143

Answers (2)

Dan Getz
Dan Getz

Reputation: 18217

The general idea I propose, is to use a neutral regular grid of evaluation points to calculate difference between irregular datasets. In order to evaluate datasets on this grid we use linear interpolation. The comments in the code go through the steps:

using Dates
using DataFrames
using Interpolations

# setup some random data
startdate = DateTime("2022-10-10", "yyyy-mm-dd")
df = DataFrame(
  time = startdate .+ Second.(rand(1:500_000, 20)), 
  pressure = 10 .+ rand(20)
)
df2 = DataFrame(
  time = startdate .+ Second.(rand(1:500_000, 20)), 
  pressure = 9 .+ rand(20)
)
# sort data by timestamp needed for interpolation
sort!(df, order(:time))
sort!(df2, order(:time))
# create numeric time value for interpolation
# TODO: suggestions for better method welcome!
time2val(x) = x.instant.periods.value
df.time2 = time2val.(df.time)/1000.0;
df2.time2 = time2val.(df2.time)/1000.0;

# create interpolation objects
itp = interpolate((df.time2,), df.pressure, Gridded(Linear()));
itp2 = interpolate((df2.time2,), df2.pressure, Gridded(Linear()));

# calculate a grid to evaluate difference on
N = 2*(nrow(df)+nrow(df2))
mintime = max(minimum(df2.time2), minimum(df.time2))
maxtime = min(maximum(df2.time2), maximum(df.time2))
interprange = range(mintime, maxtime, N)

# calculate average difference on interpolated grid
sum(itp[t]-itp2[t] for t in interprange)/N

The result is ~1.12 which stems from the data generation difference between 9 and 10 offsets on random pressures.

Upvotes: 3

Julia_lover
Julia_lover

Reputation: 106

when the rows have slightly different timestamps a closejoin must used first,e.g.

using InMemoryDatasets

df1=Dataset(Time=[Date("2022-09-28"),Date("2022-10-02"),Date("2022-10-03"),Date("2022-10-04")],Pressure=[10,11,9,15])

df2=Dataset(Time=[Date("2022-10-01"),Date("2022-10-02"),Date("2022-10-05")],Pressure=[7,9,19])

closejoin!(df1,df2,on=[:Time],makeunique=true,border=:missing)

modify!(df1,[:Pressure,:Pressure_1]=>byrow(-)=>:Difference1)

@df df1 plot(:Time,:Difference1,label="Difference1",color=:red)

Upvotes: 6

Related Questions