Vangelis Tasoulas
Vangelis Tasoulas

Reputation: 3189

Subtract data frames with different number of rows?

I have two data frames with two columns each. The first column is timestamps and the second contains some values. One of the data frames is much bigger than the other one but both of them contains data in the same timestamp range.

If I plot these two on top of each other, I will get a nice plot showing how they differ in time.

Now I would like to get the absolute difference by time of these two dataframes to make a another plot showing how much they differ (or to create a boxplot with this information) even though they do not have the same length and exact matching timestamps.

Check this example:

df1:

timestamp | data
1334103075| 1.2
1334103085| 1.5
1334103095| 0.9
1334103105| 0.7
1334103115| 1.1
1334103125| 0.8

df2:

timestamp | data
1334103078| 1.2
1334103099| 1.5
1334103123| 0.8
1334103125| 0.9

How would I achieve something like this:

df3 <- abs(df1-df2)

As you see df2 might not have the same timestamps as df1, but they both have timestamps in the same time range.

Of course the subtraction should try to match timestamps or subtract values from timestamp averages that they are near.

Upvotes: 4

Views: 1673

Answers (1)

flodel
flodel

Reputation: 89057

I would suggest using two linear interpolators and evaluate both of them on the union of your two sets of timestamps:

df1 <- data.frame(timestamp = c(1334103075, 1334103085, 1334103095,
                                1334103105, 1334103115, 1334103125),
                  data      = c(1.2, 1.5, 0.9, 0.7, 1.1, 0.8))

df2 <- data.frame(timestamp = c(1334103078, 1334103099, 1334103123,
                                1334103125),
                  data      = c(1.2, 1.5, 0.8, 0.9))

library(Hmisc)
all.timestamps <- sort(unique(c(df1$timestamp, df2$timestamp)))
data1 <- approxExtrap(df1$timestamp, df1$data, all.timestamps)$y
data2 <- approxExtrap(df2$timestamp, df2$data, all.timestamps)$y

df3 <- data.frame(timestamp = all.timestamps,
                  data1     = data1,
                  data2     = data2,
                  abs.diff  = abs(data1 - data2))
df3
#    timestamp data1    data2   abs.diff
# 1 1334103075  1.20 1.157143 0.04285714
# 2 1334103078  1.29 1.200000 0.09000000
# 3 1334103085  1.50 1.300000 0.20000000
# 4 1334103095  0.90 1.442857 0.54285714
# 5 1334103099  0.82 1.500000 0.68000000
# 6 1334103105  0.70 1.325000 0.62500000
# 7 1334103115  1.10 1.033333 0.06666667
# 8 1334103123  0.86 0.800000 0.06000000
# 9 1334103125  0.80 0.900000 0.10000000

Then you could consider fitting splines if you are not quite happy with linear approximations.

Upvotes: 3

Related Questions