John Thomas
John Thomas

Reputation: 1105

Matching timestamps in two dataframes using dplyr

So I have a dataframe like this

DF1

ID    timestamp              value
1     2018-07-31 20:00:00    45555
1     2018-08-01 06:50:00    333
1     2018-08-01 07:00:00    322
2     2018-08-01 13:00:00    1222
2     2018-08-01 22:00:00    1111
3     2018-08-01 11:00:00    0
3     2018-08-02 08:00:00    22222

And another one like this

DF2

          startTime             endTime   ID    
2018-08-01 06:45:00 2018-08-01 09:49:00   1
2019-12-02 11:01:00 2019-12-02 11:02:00   2
2018-08-02 07:59:00 2018-08-02 08:50:00   2
2018-08-02 07:59:00 2018-08-02 08:50:00   3

So I want to create a new column using mutate which mutates DF2 that tells me if any time in between the startTime and endTime appears in DF1 that matches the ID.

So in DF2 if we have a date like 8/2/2018 and the time span is 7:59AM - 8:50AM , if any timestamp within this interval inclusive (2018-08-02 7:59:00, 2018-08-02 8:00:00, 2018-08-02 8:01:00...) appears in DF1 then a variable in DF2 reads 1, else 0. This is only the case if the ID matches. Does not matter how many times, atleast once is what matters.

Final version of DF2 should look like.

          startTime             endTime   ID   match
2018-08-01 06:45:00 2018-08-01 09:49:00   1    1
2019-12-02 11:01:00 2019-12-02 11:02:00   2    0
2018-08-02 07:59:00 2018-08-02 08:50:00   2    0
2018-08-02 07:59:00 2018-08-02 08:50:00   3    1

Ideally I would like to use BETWEEN() and remain in the tidyverse.

Upvotes: 3

Views: 408

Answers (2)

RyanFrost
RyanFrost

Reputation: 1428

Per your request for a tidyverse solution, you can also do this by first joining the two dfs by 'ID', then using lubridate's %within% and interval (%--%) functions to find matches:

library(dplyr)
library(lubridate)

df_both <- left_join(DF2, DF1, by = "ID") 

df_both %>%
  group_by(ID, startTime, endTime) %>%
  summarize(match = any(timestamp %within% (startTime %--% endTime)),
            .groups = "drop")
#> # A tibble: 4 x 4
#>      ID startTime           endTime             match
#>   <int> <dttm>              <dttm>              <lgl>
#> 1     1 2018-08-01 06:45:00 2018-08-01 09:49:00 TRUE 
#> 2     2 2018-08-02 07:59:00 2018-08-02 08:50:00 FALSE
#> 3     2 2019-12-02 12:01:00 2019-12-02 12:02:00 FALSE
#> 4     3 2018-08-02 07:59:00 2018-08-02 08:50:00 TRUE

If you prefer 1/0 instead of TRUE/FALSE, you can use as.integer to convert the 'match' column as such.

Upvotes: 1

akrun
akrun

Reputation: 886948

We could use a data.table non-equi join on the DateTime columns along with 'ID' column, create the match by checking if the number of rows (.N) is greater than 0

library(data.table)
DF2$match <- 0L
setDT(DF2)[DF1,  match := +(.N > 0),on = .(ID, 
        startTime < timestamp, endTime > timestamp), by = .EACHI]
DF2
#             startTime             endTime ID match
#1: 2018-08-01 06:45:00 2018-08-01 09:49:00  1     1
#2: 2019-12-02 11:01:00 2019-12-02 11:02:00  2     0
#3: 2018-08-02 07:59:00 2018-08-02 08:50:00  2     0
#4: 2018-08-02 07:59:00 2018-08-02 08:50:00  3     1

data

DF1 <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 3L, 3L),
 timestamp = structure(c(1533085200, 
1533124200, 1533124800, 1533146400, 1533178800, 1533139200, 1533214800
), class = c("POSIXct", "POSIXt"), tzone = ""), value = c(45555L, 
333L, 322L, 1222L, 1111L, 0L, 22222L)), row.names = c(NA, -7L
), class = "data.frame")

DF2 <- structure(list(startTime = structure(c(1533123900, 1575306060, 
1533214740, 1533214740), class = c("POSIXct", "POSIXt"), tzone = ""), 
    endTime = structure(c(1533134940, 1575306120, 1533217800, 
    1533217800), class = c("POSIXct", "POSIXt"), tzone = ""), 
    ID = c(1L, 2L, 2L, 3L)), row.names = c(NA, -4L), class = "data.frame")

Upvotes: 3

Related Questions