Reputation: 21432
I have data from different measurements of the same temporal event stored in two dataframes. The timestamps in the two dataframes are not 100% the same though. I'd like to merge the two dataframes by synchronizing them as well as possible.
This is df1
(reproducible data below):
df1
rec_time gsr
1 00:00:01.902 2.513
2 00:00:01.905 2.513
3 00:00:01.907 2.518
4 00:00:01.909 2.524
5 00:00:01.913 2.532
6 00:00:01.916 2.499
7 00:00:01.918 2.513
8 00:00:01.922 2.500
9 00:00:01.927 2.513
10 00:00:01.930 2.513
11 00:00:01.989 2.513
12 00:00:01.994 2.513
13 00:00:01.996 2.513
14 00:00:01.999 2.513
15 00:00:02.004 2.513
This is df2
(reproducible data below):
df2
Start_Time AOI
1 00:00:01.905 Iso
2 00:00:01.914 Kat
3 00:00:01.925 Iso
4 00:00:01.930 Iso
5 00:00:01.991 Kat
As can be seen, sometimes the timestamps (rec_time
in df1
and Start_Time
in df2
) show the same time; more often though there are differences. Now for the merge, where there are differences, I'd like to put the rows from df2
on that row in df1
where the time difference is smallest. How can this synchronization be achieved?
The expected result is this:
df3
rec_time gsr Start_Time AOI
1 00:00:01.902 2.513 <NA> <NA>
2 00:00:01.905 2.513 00:00:01.905 Iso # same time
3 00:00:01.907 2.518 <NA> <NA>
4 00:00:01.909 2.524 <NA> <NA>
5 00:00:01.913 2.532 00:00:01.914 Kat # different time
6 00:00:01.916 2.499 <NA> <NA>
7 00:00:01.918 2.513 <NA> <NA>
8 00:00:01.922 2.500 <NA> <NA>
9 00:00:01.927 2.513 00:00:01.925 Iso # different time
10 00:00:01.930 2.513 00:00:01.930 Iso # same time
11 00:00:01.989 2.513 <NA> <NA>
12 00:00:01.994 2.513 00:00:01.991 Kat # different time
13 00:00:01.996 2.513 <NA> <NA>
14 00:00:01.999 2.513 <NA> <NA>
15 00:00:02.004 2.513 <NA> <NA>
What I've tried so far is meager:
merge(df1, df2, by.x = "rec_time", by.y = "Start_Time", all.x = T)
This obviously only catches those rows that have identical timestamps.
Reproducible data:
df1 <- dput(gsr)
structure(list(rec_time = structure(1:15, .Label = c("00:00:01.902",
"00:00:01.905", "00:00:01.907", "00:00:01.909", "00:00:01.913",
"00:00:01.916", "00:00:01.918", "00:00:01.922", "00:00:01.927",
"00:00:01.930", "00:00:01.989", "00:00:01.994", "00:00:01.996",
"00:00:01.999", "00:00:02.004"), class = "factor"), gsr = c(2.513,
2.513, 2.518, 2.524, 2.532, 2.499, 2.513, 2.5, 2.513, 2.513,
2.513, 2.513, 2.513, 2.513, 2.513)), class = "data.frame", row.names = c(NA,
-15L))
df2 <- dput(et)
structure(list(Start_Time = structure(1:5, .Label = c("00:00:01.905",
"00:00:01.914", "00:00:01.925", "00:00:01.930", "00:00:01.991"
), class = "factor"), AOI = structure(c(1L, 2L, 1L, 1L, 2L), .Label = c("Iso",
"Kat"), class = "factor")), class = "data.frame", row.names = c(NA,
-5L))
Upvotes: 1
Views: 534
Reputation: 8880
using @Allan Cameron suggestion one can try the library(fuzzyjoin)
varying max_dist
, you can get different degrees of accuracy
difference_left_join(x = df1,
y = df2,
by = c("ms_time"),
max_dist = 2)
rec_time gsr ms_time.x Start_Time AOI ms_time.y
1 00:00:01.902 2.513 1902 <NA> <NA> NA
2 00:00:01.905 2.513 1905 00:00:01.905 Iso 1905
3 00:00:01.907 2.518 1907 00:00:01.905 Iso 1905
4 00:00:01.909 2.524 1909 <NA> <NA> NA
5 00:00:01.913 2.532 1913 00:00:01.914 Kat 1914
6 00:00:01.916 2.499 1916 00:00:01.914 Kat 1914
7 00:00:01.918 2.513 1918 <NA> <NA> NA
8 00:00:01.922 2.500 1922 <NA> <NA> NA
9 00:00:01.927 2.513 1927 00:00:01.925 Iso 1925
10 00:00:01.930 2.513 1930 00:00:01.930 Iso 1930
11 00:00:01.989 2.513 1989 00:00:01.991 Kat 1991
12 00:00:01.994 2.513 1994 <NA> <NA> NA
13 00:00:01.996 2.513 1996 <NA> <NA> NA
14 00:00:01.999 2.513 1999 <NA> <NA> NA
15 00:00:02.004 2.513 2004 <NA> <NA> NA
Upvotes: 3
Reputation: 174328
I think it would be easier to work with the data if we first converted the timestamps to milliseconds:
library(dplyr)
timestamp_to_ms <- function(t) {
sapply(strsplit(as.character(t), ":"), function(x) {
1000 * sum(c(3600, 60, 1) * as.numeric(x))
})
}
df1$ms_time <- timestamp_to_ms(df1$rec_time)
df2$ms_time <- timestamp_to_ms(df2$Start_Time)
Now that we have the time in milliseconds in both data frames, it is easy to find the nearest matching timestamps and record them in df2
:
df2$ms_time <- sapply(df2$ms_time, function(x) {
df1$ms_time[which.min(abs(x - df1$ms_time))]
})
This then allows a standard left_join
:
df3 <- left_join(df1, df2, by = "ms_time") %>%
select(-ms_time)
df3
#> rec_time gsr Start_Time AOI
#> 1 00:00:01.902 2.513 <NA> <NA>
#> 2 00:00:01.905 2.513 00:00:01.905 Iso
#> 3 00:00:01.907 2.518 <NA> <NA>
#> 4 00:00:01.909 2.524 <NA> <NA>
#> 5 00:00:01.913 2.532 00:00:01.914 Kat
#> 6 00:00:01.916 2.499 <NA> <NA>
#> 7 00:00:01.918 2.513 <NA> <NA>
#> 8 00:00:01.922 2.500 <NA> <NA>
#> 9 00:00:01.927 2.513 00:00:01.925 Iso
#> 10 00:00:01.930 2.513 00:00:01.930 Iso
#> 11 00:00:01.989 2.513 00:00:01.991 Kat
#> 12 00:00:01.994 2.513 <NA> <NA>
#> 13 00:00:01.996 2.513 <NA> <NA>
#> 14 00:00:01.999 2.513 <NA> <NA>
#> 15 00:00:02.004 2.513 <NA> <NA>
Created on 2020-09-17 by the reprex package (v0.3.0)
Upvotes: 3