etaulbee
etaulbee

Reputation: 65

How to join data from 2 datasets conditionally based on time?

In my case, I have 2 datasets I'd like to merge. My first dataset consists of a column of locations, and a column of datetimes.

site              | datetime
---------------------------------------
Cave Mountain     | 2021-08-08 06:28:26
Laurel Run        | 2021-06-24 05:26:07
Laurel Run        | 2021-10-05 05:24:04
Laurel Run        | 2021-10-07 12:26:23
Smoke Cliffs      | 2021-08-30 20:44:39
Smoke Hole Road   | 2021-09-10 19:58:26
University Forest | 2021-09-25 02:15:04
                                                            


data <- structure(list(site = c("Cave Mountain", "Laurel Run", "Laurel Run", 
"Laurel Run", "Smoke Cliffs", "Smoke Hole Road", "University Forest"
), datetime = structure(c(1628418506.5, 1624526767, 1633425844.5, 
1633623983.5, 1630370679, 1631318306, 1632550504), tzone = "America/New_York", class = c("POSIXct", 
"POSIXt"))), class = c("grouped_df", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -7L), groups = structure(list(site = c("Cave Mountain", 
"Laurel Run", "Smoke Cliffs", "Smoke Hole Road", "University Forest"
), .rows = structure(list(1L, 2:4, 5L, 6L, 7L), ptype = integer(0), class = c("vctrs_list_of", 
"vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -5L), .drop = TRUE))

My second dataset has a column of locations like the 1st, a column of datetimes, and a column of temperature data.

site              | datetime            | tempf
------------------------------------------------
Cave Mountain     | 2020-06-20 21:20:10 | 79.601
Cave Mountain     | 2020-07-16 21:05:20 | 69.640
Cave Mountain     | 2020-10-07 21:12:00 | 66.706
Laurel Run        | 2020-06-15 21:06:30 | 60.991
Laurel Run        | 2020-06-26 21:23:40 | 64.204
Laurel Run        | 2020-08-11 21:21:10 | 69.685
Smoke Cliffs      | 2020-05-21 21:08:58 | 66.551
Smoke Cliffs      | 2020-05-28 21:00:01 | 59.369
Smoke Hole Road   | 2020-06-13 21:05:10 | 62.919
University Forest | 2020-06-14 21:14:00 | 63.230
University Forest | 2020-08-05 21:09:10 | 61.610
University Forest | 2020-08-08 21:11:00 | 65.007
University Forest | 2020-09-03 21:00:20 | 55.740
University Forest | 2020-09-19 21:16:50 | 67.246

ref <- structure(list(site = c("Cave Mountain", "Cave Mountain", "Cave Mountain", 
"Laurel Run", "Laurel Run", "Laurel Run", "Smoke Cliffs", "Smoke Cliffs", 
"Smoke Hole Road", "University Forest", "University Forest", 
"University Forest", "University Forest", "University Forest"
), datetime = structure(c(1592702410, 1594947920, 1602119520, 
1592269590, 1593221020, 1597195270, 1590109738, 1590714001, 1592096710, 
1592183640, 1596676150, 1596935460, 1599181220, 1600564610), 
tzone = "America/New_York", class = c("POSIXct", 
"POSIXt")), tempf = c(79.601, 69.64, 66.706, 60.991, 64.204, 
69.685, 66.551, 59.369, 62.919, 63.23, 61.61, 65.007, 55.74, 
67.246)), row.names = c(NA, -14L), class = c("tbl_df", "tbl", 
"data.frame"))

My aim is to pull temperature from the 2nd dataset by looking for the closest datetime to each column in the 1st dataset, and then adding the temperature from the 2nd dataset to the associated row in the first dataset. This merge should also be conditional, based on site. The desired output looks like this.

site              | datetime            | tempf
------------------------------------------------
Cave Mountain     | 2021-08-08 06:28:26 | 69.640
Laurel Run        | 2021-06-24 05:26:07 | 64.204
Laurel Run        | 2021-10-05 05:24:04 | 69.685
Laurel Run        | 2021-10-07 12:26:23 | 69.685
Smoke Cliffs      | 2021-08-30 20:44:39 | 59.369
Smoke Hole Road   | 2021-09-10 19:58:26 | 62.919
University Forest | 2021-09-25 02:15:04 | 67.246

I've been trying to achieve this using data.table's rolling join feature with roll = "nearest" , but I cannot figure out how to do achieve it conditionally with the consideration of site.

Upvotes: 3

Views: 62

Answers (1)

Waldi
Waldi

Reputation: 41220

You could specify both site and datetime in the on argument of the rolling join with nearest option:

library(data.table)
setDT(ref)
setDT(data)
ref[data,on=.(site,datetime),roll='nearest']

                site            datetime  tempf
              <char>              <POSc>  <num>
1:     Cave Mountain 2021-08-08 06:28:26 66.706
2:        Laurel Run 2021-06-24 05:26:07 69.685
3:        Laurel Run 2021-10-05 05:24:04 69.685
4:        Laurel Run 2021-10-07 12:26:23 69.685
5:      Smoke Cliffs 2021-08-30 20:44:39 59.369
6:   Smoke Hole Road 2021-09-10 19:58:26 62.919
7: University Forest 2021-09-25 02:15:04 67.246

If needed x. allows to also get access to the nearest reference point datetime, which would for example be useful to remove outdated values:

ref[data,.(site,datetime,nearest_datetime=x.datetime,tempf)
        ,on=.(site,datetime),roll='nearest']

                site            datetime    nearest_datetime  tempf
              <char>              <POSc>              <POSc>  <num>
1:     Cave Mountain 2021-08-08 06:28:26 2020-10-07 21:12:00 66.706
2:        Laurel Run 2021-06-24 05:26:07 2020-08-11 21:21:10 69.685
3:        Laurel Run 2021-10-05 05:24:04 2020-08-11 21:21:10 69.685
4:        Laurel Run 2021-10-07 12:26:23 2020-08-11 21:21:10 69.685
5:      Smoke Cliffs 2021-08-30 20:44:39 2020-05-28 21:00:01 59.369
6:   Smoke Hole Road 2021-09-10 19:58:26 2020-06-13 21:05:10 62.919
7: University Forest 2021-09-25 02:15:04 2020-09-19 21:16:50 67.246

Upvotes: 4

Related Questions