Chris Ruehlemann
Chris Ruehlemann

Reputation: 21432

How to synchronize data with partly different timestamps in R

I have data from different measurements of the same temporal event stored in two dataframes. The timestamps in the two dataframes are not 100% the same though. I'd like to merge the two dataframes by synchronizing them as well as possible.

This is df1 (reproducible data below):

df1
       rec_time   gsr
1  00:00:01.902 2.513
2  00:00:01.905 2.513
3  00:00:01.907 2.518
4  00:00:01.909 2.524
5  00:00:01.913 2.532
6  00:00:01.916 2.499
7  00:00:01.918 2.513
8  00:00:01.922 2.500
9  00:00:01.927 2.513
10 00:00:01.930 2.513
11 00:00:01.989 2.513
12 00:00:01.994 2.513
13 00:00:01.996 2.513
14 00:00:01.999 2.513
15 00:00:02.004 2.513

This is df2 (reproducible data below):

df2
    Start_Time AOI
1 00:00:01.905 Iso
2 00:00:01.914 Kat
3 00:00:01.925 Iso
4 00:00:01.930 Iso
5 00:00:01.991 Kat

As can be seen, sometimes the timestamps (rec_time in df1 and Start_Timein df2) show the same time; more often though there are differences. Now for the merge, where there are differences, I'd like to put the rows from df2 on that row in df1 where the time difference is smallest. How can this synchronization be achieved?

The expected result is this:

df3
       rec_time   gsr   Start_Time  AOI
1  00:00:01.902 2.513         <NA> <NA>
2  00:00:01.905 2.513 00:00:01.905  Iso   # same time
3  00:00:01.907 2.518         <NA> <NA>
4  00:00:01.909 2.524         <NA> <NA>
5  00:00:01.913 2.532 00:00:01.914  Kat   # different time
6  00:00:01.916 2.499         <NA> <NA>
7  00:00:01.918 2.513         <NA> <NA>
8  00:00:01.922 2.500         <NA> <NA>
9  00:00:01.927 2.513 00:00:01.925  Iso   # different time
10 00:00:01.930 2.513 00:00:01.930  Iso   # same time
11 00:00:01.989 2.513         <NA> <NA>
12 00:00:01.994 2.513 00:00:01.991  Kat   # different time
13 00:00:01.996 2.513         <NA> <NA>
14 00:00:01.999 2.513         <NA> <NA>
15 00:00:02.004 2.513         <NA> <NA>

What I've tried so far is meager:

merge(df1, df2, by.x = "rec_time", by.y = "Start_Time", all.x = T)

This obviously only catches those rows that have identical timestamps.

Reproducible data:

df1 <- dput(gsr)
structure(list(rec_time = structure(1:15, .Label = c("00:00:01.902", 
"00:00:01.905", "00:00:01.907", "00:00:01.909", "00:00:01.913", 
"00:00:01.916", "00:00:01.918", "00:00:01.922", "00:00:01.927", 
"00:00:01.930", "00:00:01.989", "00:00:01.994", "00:00:01.996", 
"00:00:01.999", "00:00:02.004"), class = "factor"), gsr = c(2.513, 
2.513, 2.518, 2.524, 2.532, 2.499, 2.513, 2.5, 2.513, 2.513, 
2.513, 2.513, 2.513, 2.513, 2.513)), class = "data.frame", row.names = c(NA, 
-15L))

df2 <- dput(et)
structure(list(Start_Time = structure(1:5, .Label = c("00:00:01.905", 
"00:00:01.914", "00:00:01.925", "00:00:01.930", "00:00:01.991"
), class = "factor"), AOI = structure(c(1L, 2L, 1L, 1L, 2L), .Label = c("Iso", 
"Kat"), class = "factor")), class = "data.frame", row.names = c(NA, 
-5L))

Upvotes: 1

Views: 534

Answers (2)

Yuriy Saraykin
Yuriy Saraykin

Reputation: 8880

using @Allan Cameron suggestion one can try the library(fuzzyjoin)

varying max_dist, you can get different degrees of accuracy

difference_left_join(x = df1,
                     y = df2,
                     by = c("ms_time"),
                     max_dist = 2)

       rec_time   gsr ms_time.x   Start_Time  AOI ms_time.y
1  00:00:01.902 2.513      1902         <NA> <NA>        NA
2  00:00:01.905 2.513      1905 00:00:01.905  Iso      1905
3  00:00:01.907 2.518      1907 00:00:01.905  Iso      1905
4  00:00:01.909 2.524      1909         <NA> <NA>        NA
5  00:00:01.913 2.532      1913 00:00:01.914  Kat      1914
6  00:00:01.916 2.499      1916 00:00:01.914  Kat      1914
7  00:00:01.918 2.513      1918         <NA> <NA>        NA
8  00:00:01.922 2.500      1922         <NA> <NA>        NA
9  00:00:01.927 2.513      1927 00:00:01.925  Iso      1925
10 00:00:01.930 2.513      1930 00:00:01.930  Iso      1930
11 00:00:01.989 2.513      1989 00:00:01.991  Kat      1991
12 00:00:01.994 2.513      1994         <NA> <NA>        NA
13 00:00:01.996 2.513      1996         <NA> <NA>        NA
14 00:00:01.999 2.513      1999         <NA> <NA>        NA
15 00:00:02.004 2.513      2004         <NA> <NA>        NA

Upvotes: 3

Allan Cameron
Allan Cameron

Reputation: 174328

I think it would be easier to work with the data if we first converted the timestamps to milliseconds:

library(dplyr)

timestamp_to_ms <- function(t) {
  sapply(strsplit(as.character(t), ":"), function(x) {
    1000 * sum(c(3600, 60, 1) * as.numeric(x))
  })
}

df1$ms_time <- timestamp_to_ms(df1$rec_time)
df2$ms_time <- timestamp_to_ms(df2$Start_Time)

Now that we have the time in milliseconds in both data frames, it is easy to find the nearest matching timestamps and record them in df2:

df2$ms_time <- sapply(df2$ms_time, function(x) {
  df1$ms_time[which.min(abs(x - df1$ms_time))]
})

This then allows a standard left_join:

df3 <- left_join(df1, df2, by = "ms_time") %>% 
         select(-ms_time)

df3
#>        rec_time   gsr   Start_Time  AOI
#> 1  00:00:01.902 2.513         <NA> <NA>
#> 2  00:00:01.905 2.513 00:00:01.905  Iso
#> 3  00:00:01.907 2.518         <NA> <NA>
#> 4  00:00:01.909 2.524         <NA> <NA>
#> 5  00:00:01.913 2.532 00:00:01.914  Kat
#> 6  00:00:01.916 2.499         <NA> <NA>
#> 7  00:00:01.918 2.513         <NA> <NA>
#> 8  00:00:01.922 2.500         <NA> <NA>
#> 9  00:00:01.927 2.513 00:00:01.925  Iso
#> 10 00:00:01.930 2.513 00:00:01.930  Iso
#> 11 00:00:01.989 2.513 00:00:01.991  Kat
#> 12 00:00:01.994 2.513         <NA> <NA>
#> 13 00:00:01.996 2.513         <NA> <NA>
#> 14 00:00:01.999 2.513         <NA> <NA>
#> 15 00:00:02.004 2.513         <NA> <NA>

Created on 2020-09-17 by the reprex package (v0.3.0)

Upvotes: 3

Related Questions