Reputation: 65
In my case, I have 2 datasets I'd like to merge. My first dataset consists of a column of locations, and a column of datetimes.
site | datetime
---------------------------------------
Cave Mountain | 2021-08-08 06:28:26
Laurel Run | 2021-06-24 05:26:07
Laurel Run | 2021-10-05 05:24:04
Laurel Run | 2021-10-07 12:26:23
Smoke Cliffs | 2021-08-30 20:44:39
Smoke Hole Road | 2021-09-10 19:58:26
University Forest | 2021-09-25 02:15:04
data <- structure(list(site = c("Cave Mountain", "Laurel Run", "Laurel Run",
"Laurel Run", "Smoke Cliffs", "Smoke Hole Road", "University Forest"
), datetime = structure(c(1628418506.5, 1624526767, 1633425844.5,
1633623983.5, 1630370679, 1631318306, 1632550504), tzone = "America/New_York", class = c("POSIXct",
"POSIXt"))), class = c("grouped_df", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -7L), groups = structure(list(site = c("Cave Mountain",
"Laurel Run", "Smoke Cliffs", "Smoke Hole Road", "University Forest"
), .rows = structure(list(1L, 2:4, 5L, 6L, 7L), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -5L), .drop = TRUE))
My second dataset has a column of locations like the 1st, a column of datetimes, and a column of temperature data.
site | datetime | tempf
------------------------------------------------
Cave Mountain | 2020-06-20 21:20:10 | 79.601
Cave Mountain | 2020-07-16 21:05:20 | 69.640
Cave Mountain | 2020-10-07 21:12:00 | 66.706
Laurel Run | 2020-06-15 21:06:30 | 60.991
Laurel Run | 2020-06-26 21:23:40 | 64.204
Laurel Run | 2020-08-11 21:21:10 | 69.685
Smoke Cliffs | 2020-05-21 21:08:58 | 66.551
Smoke Cliffs | 2020-05-28 21:00:01 | 59.369
Smoke Hole Road | 2020-06-13 21:05:10 | 62.919
University Forest | 2020-06-14 21:14:00 | 63.230
University Forest | 2020-08-05 21:09:10 | 61.610
University Forest | 2020-08-08 21:11:00 | 65.007
University Forest | 2020-09-03 21:00:20 | 55.740
University Forest | 2020-09-19 21:16:50 | 67.246
ref <- structure(list(site = c("Cave Mountain", "Cave Mountain", "Cave Mountain",
"Laurel Run", "Laurel Run", "Laurel Run", "Smoke Cliffs", "Smoke Cliffs",
"Smoke Hole Road", "University Forest", "University Forest",
"University Forest", "University Forest", "University Forest"
), datetime = structure(c(1592702410, 1594947920, 1602119520,
1592269590, 1593221020, 1597195270, 1590109738, 1590714001, 1592096710,
1592183640, 1596676150, 1596935460, 1599181220, 1600564610),
tzone = "America/New_York", class = c("POSIXct",
"POSIXt")), tempf = c(79.601, 69.64, 66.706, 60.991, 64.204,
69.685, 66.551, 59.369, 62.919, 63.23, 61.61, 65.007, 55.74,
67.246)), row.names = c(NA, -14L), class = c("tbl_df", "tbl",
"data.frame"))
My aim is to pull temperature from the 2nd dataset by looking for the closest datetime to each column in the 1st dataset, and then adding the temperature from the 2nd dataset to the associated row in the first dataset. This merge should also be conditional, based on site. The desired output looks like this.
site | datetime | tempf
------------------------------------------------
Cave Mountain | 2021-08-08 06:28:26 | 69.640
Laurel Run | 2021-06-24 05:26:07 | 64.204
Laurel Run | 2021-10-05 05:24:04 | 69.685
Laurel Run | 2021-10-07 12:26:23 | 69.685
Smoke Cliffs | 2021-08-30 20:44:39 | 59.369
Smoke Hole Road | 2021-09-10 19:58:26 | 62.919
University Forest | 2021-09-25 02:15:04 | 67.246
I've been trying to achieve this using data.table
's rolling join feature with roll = "nearest" , but I cannot figure out how to do achieve it conditionally with the consideration of site.
Upvotes: 3
Views: 62
Reputation: 41220
You could specify both site
and datetime
in the on
argument of the rolling join with nearest
option:
library(data.table)
setDT(ref)
setDT(data)
ref[data,on=.(site,datetime),roll='nearest']
site datetime tempf
<char> <POSc> <num>
1: Cave Mountain 2021-08-08 06:28:26 66.706
2: Laurel Run 2021-06-24 05:26:07 69.685
3: Laurel Run 2021-10-05 05:24:04 69.685
4: Laurel Run 2021-10-07 12:26:23 69.685
5: Smoke Cliffs 2021-08-30 20:44:39 59.369
6: Smoke Hole Road 2021-09-10 19:58:26 62.919
7: University Forest 2021-09-25 02:15:04 67.246
If needed x.
allows to also get access to the nearest reference point datetime
, which would for example be useful to remove outdated values:
ref[data,.(site,datetime,nearest_datetime=x.datetime,tempf)
,on=.(site,datetime),roll='nearest']
site datetime nearest_datetime tempf
<char> <POSc> <POSc> <num>
1: Cave Mountain 2021-08-08 06:28:26 2020-10-07 21:12:00 66.706
2: Laurel Run 2021-06-24 05:26:07 2020-08-11 21:21:10 69.685
3: Laurel Run 2021-10-05 05:24:04 2020-08-11 21:21:10 69.685
4: Laurel Run 2021-10-07 12:26:23 2020-08-11 21:21:10 69.685
5: Smoke Cliffs 2021-08-30 20:44:39 2020-05-28 21:00:01 59.369
6: Smoke Hole Road 2021-09-10 19:58:26 2020-06-13 21:05:10 62.919
7: University Forest 2021-09-25 02:15:04 2020-09-19 21:16:50 67.246
Upvotes: 4