Reputation: 741
I need to join two dataset with the closet timestamp. The first datset is diary dataset from an mobile application:
df1 <- data.frame(stringsAsFactors=FALSE,
datetime = c("2019-03-19T13:26:52Z", "2019-03-19T13:26:19Z",
"2019-03-19T13:23:46Z", "2019-03-19T13:22:20Z",
"2019-03-19T13:09:56Z", "2019-03-19T13:06:04Z", "2019-03-19T13:05:21Z",
"2019-03-19T13:04:37Z", "2019-03-19T12:47:28Z",
"2019-03-19T12:46:42Z"),
transport = c("Trainride", "Trainride", "Trainride", "Trainride",
"Trainride", "Trainride", "Trainride", "Trainride",
"Trainride", "Trainride"),
id = c("5-3", "5-3", "5-3", "5-3", "5-3", "5-3", "5-3", "5-3", "5-3",
"5-3"),
disc = c("start", "stop", "start", "stop", "start", "stop", "start",
"stop", "start", "stop")
)
# datetime dttr object
df1 <- df1 %>%
mutate(datetime = lubridate::as_datetime(datetime))
here:
datetime transport id disc
1 2019-03-19 13:26:52 Trainride 5-3 start
2 2019-03-19 13:26:19 Trainride 5-3 stop
3 2019-03-19 13:23:46 Trainride 5-3 start
4 2019-03-19 13:22:20 Trainride 5-3 stop
5 2019-03-19 13:09:56 Trainride 5-3 start
6 2019-03-19 13:06:04 Trainride 5-3 stop
7 2019-03-19 13:05:21 Trainride 5-3 start
8 2019-03-19 13:04:37 Trainride 5-3 stop
9 2019-03-19 12:47:28 Trainride 5-3 start
10 2019-03-19 12:46:42 Trainride 5-3 stop
and the second dataset is dataset from accelerometer log, showing movement(=INVH) or idle(=NIVH) :
df2 <- data.frame(stringsAsFactors=FALSE,
datetime = c("2019-03-19T23:20:00Z", "2019-03-19T23:17:30Z",
"2019-03-19T13:08:00Z", "2019-03-19T13:07:00Z",
"2019-03-19T12:38:45Z", "2019-03-19T12:32:45Z",
"2019-03-19T11:13:15Z", "2019-03-19T11:11:45Z", "2019-03-19T10:17:45Z",
"2019-03-19T10:16:45Z"),
id = c("5-3", "5-3", "5-3", "5-3", "5-3", "5-3", "5-3", "5-3", "5-3",
"5-3"),
code = c("NIVH", "INVH", "NIVH", "INVH", "NIVH", "INVH", "NIVH",
"INVH", "NIVH", "INVH")
)
# datetime dttr object
df2 <- df2 %>%
mutate(datetime = lubridate::as_datetime(datetime))
here:
datetime id code
1 2019-03-19 23:20:00 5-3 NIVH
2 2019-03-19 23:17:30 5-3 INVH
3 2019-03-19 13:08:00 5-3 NIVH
4 2019-03-19 13:07:00 5-3 INVH
5 2019-03-19 12:38:45 5-3 NIVH
6 2019-03-19 12:32:45 5-3 INVH
7 2019-03-19 11:13:15 5-3 NIVH
8 2019-03-19 11:11:45 5-3 INVH
9 2019-03-19 10:17:45 5-3 NIVH
10 2019-03-19 10:16:45 5-3 INVH
I need to join two data frames based on time difference between timestamp fields. for example left join on df1 to see how app diary data is an agreement with real accelerometer log. simple left join will not work here becuase in most of the case there is a lag time. So my questen is how can i join these two dataset based on the record where the difference is minimum absolute difference.
EDIT 2 --- The solution suggested by @soren is helpful, however If I take start datetime and endtime to have a interval , will not work. any thought?
df1 <- df1 %>%
mutate(datetime = lubridate::as_datetime(datetime)) %>%
arrange(datetime) %>%
mutate(datetime_end = lead(datetime),
# Create an interval object.
Travel_Interval = lubridate::interval(start = datetime, end = datetime_end))
Upvotes: 3
Views: 326
Reputation: 2425
A rolling join will accomplish this, joining the data frame by nearest datetime. The following solution using data.table
library(data.table)
dt1 <- as.data.table(df1)
setkeyv(dt1,"datetime")
dt2 <- as.data.table(df2)
setkeyv(dt2,"datetime")
dt2[,nearest_date:=datetime]
dt2[dt1,roll="nearest"]
> dt2[dt1,roll="nearest"]
datetime id code nearest_date transport i.id disc
1: 2019-03-19 12:46:42 5-3 NIVH 2019-03-19 12:38:45 Trainride 5-3 stop
2: 2019-03-19 12:47:28 5-3 NIVH 2019-03-19 12:38:45 Trainride 5-3 start
3: 2019-03-19 13:04:37 5-3 INVH 2019-03-19 13:07:00 Trainride 5-3 stop
4: 2019-03-19 13:05:21 5-3 INVH 2019-03-19 13:07:00 Trainride 5-3 start
5: 2019-03-19 13:06:04 5-3 INVH 2019-03-19 13:07:00 Trainride 5-3 stop
6: 2019-03-19 13:09:56 5-3 NIVH 2019-03-19 13:08:00 Trainride 5-3 start
7: 2019-03-19 13:22:20 5-3 NIVH 2019-03-19 13:08:00 Trainride 5-3 stop
8: 2019-03-19 13:23:46 5-3 NIVH 2019-03-19 13:08:00 Trainride 5-3 start
9: 2019-03-19 13:26:19 5-3 NIVH 2019-03-19 13:08:00 Trainride 5-3 stop
10: 2019-03-19 13:26:52 5-3 NIVH 2019-03-19 13:08:00 Trainride 5-3 start
Note that the additional column "nearest_date" is added dt2[,nearest_date:=datetime]
so that when dt2 and dt1 are joined, the value of the datetime is retained (or available to use for difference calculations). Whereas merging/joining datasets together by default return only the primary keyed column. Normally, this is desirable since joined columns are frequently identical. Here, with a rolling join they are not and it's useful to retain the column, or to manually create a redundant one, as is exemplified here.
Note also that nearest id
value and datetime
may also be relevant for the application (only one id value given in OP). This is accomplished by setting the keys, which are used in the join syntax:
setkeyv(dt2,c("id","datetime"))
setkeyv(dt1,c("id","datetime"))
Upvotes: 9