Reputation: 741
I have two data frames like the dummy samples , df1 (main data set) and df2. These data originally coming from two different log data.
df1 is always complete dataset from 02:00 midnight until 02:00 next day with different stream values during this period. But df2 only has value if there is an activity in the specific time period (between 02:00 and next day 02:00).
I was wondering how can I combine df1 and df2. My goal is to have similar complete dataset like df1 ( from 02:00 to next day 02:00) with additional variable - code - this variable can be empty if there is no activity/code in that start / end time in df2. And sometime the row in df1 needs to broken into 2 if there is an activity only in some part of that start / end time period
I am looking for the result similar to this. Showing only 4 rows just for example
id STREAM_1 STREAM_2 STREAM_3 START END
401 NIVH-ON IN null 2022-08-16 7:43:30 2022-08-16 7:45:00
401 INVH-ON OUT 12be4 2022-08-16 7:45:00 2022-08-16 7:49:00
401 INVH-ON OUT 12be4 2022-08-16 7:49:00 2022-08-16 8:15:00
df1:
df1 <- tibble::tribble(
~id, ~STREAM_1, ~STREAM_2, ~START, ~END,
401L, "NIVH-OFF", "IN", "2022-08-16 2:00:00", "2022-08-16 3:35:00",
401L, "NIVH-OFF", "IN", "2022-08-16 3:35:00", "2022-08-16 7:22:45",
401L, "NIVH-ON", "IN", "2022-08-16 7:22:45", "2022-08-16 7:31:45",
401L, "NIVH-DOCK", "IN", "2022-08-16 7:31:45", "2022-08-16 7:43:30",
401L, "NIVH-ON", "IN", "2022-08-16 7:43:30", "2022-08-16 7:45:00",
401L, "INVH-ON", "IN", "2022-08-16 7:45:00", "2022-08-16 7:49:00",
401L, "INVH-ON", "OUT", "2022-08-16 7:49:00", "2022-08-16 8:50:00",
401L, "INVH-ON", "IN", "2022-08-16 8:50:00", "2022-08-16 8:56:00",
401L, "NIVH-ON", "IN", "2022-08-16 8:56:00", "2022-08-16 8:58:00",
401L, "INVH-ON", "IN", "2022-08-16 8:58:00", "2022-08-16 9:00:00",
401L, "NIVH-ON", "IN", "2022-08-16 9:00:00", "2022-08-16 9:24:00",
401L, "INVH-ON", "IN", "2022-08-16 9:24:00", "2022-08-16 9:25:00",
401L, "NIVH-ON", "IN", "2022-08-16 9:25:00", "2022-08-16 9:55:00",
401L, "NIVH-ON", "OUT", "2022-08-16 9:55:00", "2022-08-16 10:20:00",
401L, "NIVH-ON", "IN", "2022-08-16 10:20:00", "2022-08-16 11:30:00",
401L, "NIVH-ON", "OUT", "2022-08-16 11:30:00", "2022-08-16 12:00:00",
401L, "NIVH-ON", "IN", "2022-08-16 12:00:00", "2022-08-16 12:03:00",
401L, "INVH-ON", "IN", "2022-08-16 12:03:00", "2022-08-16 12:06:00",
401L, "NIVH-ON", "IN", "2022-08-16 12:06:00", "2022-08-16 12:08:00",
401L, "INVH-ON", "IN", "2022-08-16 12:08:00", "2022-08-16 12:11:00",
401L, "NIVH-ON", "IN", "2022-08-16 12:11:00", "2022-08-16 12:33:00",
401L, "INVH-ON", "IN", "2022-08-16 12:33:00", "2022-08-16 12:34:00",
401L, "NIVH-ON", "IN", "2022-08-16 12:34:00", "2022-08-16 13:25:00",
401L, "NIVH-ON", "IN", "2022-08-16 13:25:00", "2022-08-16 13:35:00",
401L, "NIVH-ON", "IN", "2022-08-16 13:35:00", "2022-08-16 14:14:00",
401L, "NIVH-ON", "IN", "2022-08-16 14:14:00", "2022-08-16 14:15:00",
401L, "INVH-ON", "IN", "2022-08-16 14:15:00", "2022-08-16 14:19:00",
401L, "NIVH-ON", "IN", "2022-08-16 14:19:00", "2022-08-16 15:04:00",
401L, "NIVH-ON", "IN", "2022-08-16 15:04:00", "2022-08-16 15:09:00",
401L, "NIVH-ON", "IN", "2022-08-16 15:09:00", "2022-08-16 15:24:00",
401L, "NIVH-ON", "OUT", "2022-08-16 15:24:00", "2022-08-16 15:26:00",
401L, "INVH-ON", "OUT", "2022-08-16 15:26:00", "2022-08-16 16:04:00",
401L, "INVH-ON", "IN", "2022-08-16 16:04:00", "2022-08-16 16:07:00",
401L, "NIVH-ON", "IN", "2022-08-16 16:07:00", "2022-08-16 16:13:00",
401L, "INVH-ON", "IN", "2022-08-16 16:13:00", "2022-08-16 16:15:00",
401L, "NIVH-ON", "IN", "2022-08-16 16:15:00", "2022-08-16 16:20:15",
401L, "NIVH-DOCK", "IN", "2022-08-16 16:20:15", "2022-08-17 2:00:00"
)
df2:
df2 <- tibble::tribble(
~id, ~code, ~date, ~start_time, ~endtime,
401L, "12be4", 20220816L, "2022-08-16 07:45:15", "2022-08-16 08:15:15",
401L, "12be4", 20220816L, "2022-08-16 08:15:45", "2022-08-16 08:21:15",
401L, "12be4", 20220816L, "2022-08-16 08:21:30", "2022-08-16 08:40:15",
401L, "11e44", 20220816L, "2022-08-16 14:49:30", "2022-08-16 15:17:30",
401L, "12be4", 20220816L, "2022-08-16 15:24:30", "2022-08-16 15:33:15",
401L, "1fb84", 20220816L, "2022-08-16 15:33:15", "2022-08-16 15:43:45",
401L, "1fb84", 20220816L, "2022-08-16 15:46:30", "2022-08-16 15:50:30",
401L, "1b1c4", 20220816L, "2022-08-16 15:50:30", "2022-08-16 16:02:45"
)
Upvotes: 7
Views: 806
Reputation: 41220
You could use foverlaps
from data.table
package
library(data.table)
# convert to data.table
setDT(df1)
setDT(df2)
# Character dates to numeric dates
df1[,START:=as.POSIXct(START)]
df1[,END:=as.POSIXct(END)]
df2[,start_time:=as.POSIXct(start_time)]
df2[,endtime:=as.POSIXct(endtime)]
# key needed for second data.table
setkey(df2,id,start_time,endtime)
# Overlap join
foverlaps(df1,df2,by.x = c("id","START","END"))[
,.(id,STREAM_1,STREAM_2,STREAM_3=code,START,END)][
,.(STREAM_3=first(STREAM_3)),by=.(id,STREAM_1,STREAM_2,START,END)]
id STREAM_1 STREAM_2 START END STREAM_3
<int> <char> <char> <POSc> <POSc> <char>
1: 401 NIVH-OFF IN 2022-08-16 02:00:00 2022-08-16 03:35:00 <NA>
2: 401 NIVH-OFF IN 2022-08-16 03:35:00 2022-08-16 07:22:45 <NA>
3: 401 NIVH-ON IN 2022-08-16 07:22:45 2022-08-16 07:31:45 <NA>
4: 401 NIVH-DOCK IN 2022-08-16 07:31:45 2022-08-16 07:43:30 <NA>
5: 401 NIVH-ON IN 2022-08-16 07:43:30 2022-08-16 07:45:00 <NA>
6: 401 INVH-ON IN 2022-08-16 07:45:00 2022-08-16 07:49:00 12be4
7: 401 INVH-ON OUT 2022-08-16 07:49:00 2022-08-16 08:50:00 12be4
8: 401 INVH-ON IN 2022-08-16 08:50:00 2022-08-16 08:56:00 <NA>
9: 401 NIVH-ON IN 2022-08-16 08:56:00 2022-08-16 08:58:00 <NA>
10: 401 INVH-ON IN 2022-08-16 08:58:00 2022-08-16 09:00:00 <NA>
...
Upvotes: 5
Reputation: 66415
Adapting @Quinten's answer for the dev version of dplyr:
#devtools::install_github("tidyverse/dplyr") # v1.0.99.9000 @ 2022-08-25
library(dplyr)
left_join(df1 %>% mutate(across(START:END, lubridate::ymd_hms)),
df2 %>% mutate(across(start_time:endtime, lubridate::ymd_hms)),
join_by(id == id, START < start_time, END > endtime))
Upvotes: 0
Reputation: 41225
Maybe you want to use a fuzzy_left_join
from the package fuzzyjoin. Here you can join by multiple columns and specify the match_fun
per column you want like "START" > "start_time" for example. You can replace the NAs with null. Here is a reproducible example:
library(tidyverse)
library(lubridate)
library(fuzzyjoin)
output <- df1 %>%
mutate(START = ymd_hms(START),
END = ymd_hms(END)) %>%
fuzzy_left_join(df2 %>% mutate(start_time = ymd_hms(start_time),
endtime = ymd_hms(endtime)),
by = c("id" = "id", "START" = "start_time", "END" = "endtime"),
match_fun = list(`==`, `<`, `>`)) %>%
mutate(code = ifelse(is.na(code), "null", code))
print(output, n = 41)
#> # A tibble: 40 × 10
#> id.x STREAM_1 STREAM_2 START END id.y code
#> <int> <chr> <chr> <dttm> <dttm> <int> <chr>
#> 1 401 NIVH-OFF IN 2022-08-16 02:00:00 2022-08-16 03:35:00 NA null
#> 2 401 NIVH-OFF IN 2022-08-16 03:35:00 2022-08-16 07:22:45 NA null
#> 3 401 NIVH-ON IN 2022-08-16 07:22:45 2022-08-16 07:31:45 NA null
#> 4 401 NIVH-DOCK IN 2022-08-16 07:31:45 2022-08-16 07:43:30 NA null
#> 5 401 NIVH-ON IN 2022-08-16 07:43:30 2022-08-16 07:45:00 NA null
#> 6 401 INVH-ON IN 2022-08-16 07:45:00 2022-08-16 07:49:00 NA null
#> 7 401 INVH-ON OUT 2022-08-16 07:49:00 2022-08-16 08:50:00 401 12be4
#> 8 401 INVH-ON OUT 2022-08-16 07:49:00 2022-08-16 08:50:00 401 12be4
#> 9 401 INVH-ON IN 2022-08-16 08:50:00 2022-08-16 08:56:00 NA null
#> 10 401 NIVH-ON IN 2022-08-16 08:56:00 2022-08-16 08:58:00 NA null
#> 11 401 INVH-ON IN 2022-08-16 08:58:00 2022-08-16 09:00:00 NA null
#> 12 401 NIVH-ON IN 2022-08-16 09:00:00 2022-08-16 09:24:00 NA null
#> 13 401 INVH-ON IN 2022-08-16 09:24:00 2022-08-16 09:25:00 NA null
#> 14 401 NIVH-ON IN 2022-08-16 09:25:00 2022-08-16 09:55:00 NA null
#> 15 401 NIVH-ON OUT 2022-08-16 09:55:00 2022-08-16 10:20:00 NA null
#> 16 401 NIVH-ON IN 2022-08-16 10:20:00 2022-08-16 11:30:00 NA null
#> 17 401 NIVH-ON OUT 2022-08-16 11:30:00 2022-08-16 12:00:00 NA null
#> 18 401 NIVH-ON IN 2022-08-16 12:00:00 2022-08-16 12:03:00 NA null
#> 19 401 INVH-ON IN 2022-08-16 12:03:00 2022-08-16 12:06:00 NA null
#> 20 401 NIVH-ON IN 2022-08-16 12:06:00 2022-08-16 12:08:00 NA null
#> 21 401 INVH-ON IN 2022-08-16 12:08:00 2022-08-16 12:11:00 NA null
#> 22 401 NIVH-ON IN 2022-08-16 12:11:00 2022-08-16 12:33:00 NA null
#> 23 401 INVH-ON IN 2022-08-16 12:33:00 2022-08-16 12:34:00 NA null
#> 24 401 NIVH-ON IN 2022-08-16 12:34:00 2022-08-16 13:25:00 NA null
#> 25 401 NIVH-ON IN 2022-08-16 13:25:00 2022-08-16 13:35:00 NA null
#> 26 401 NIVH-ON IN 2022-08-16 13:35:00 2022-08-16 14:14:00 NA null
#> 27 401 NIVH-ON IN 2022-08-16 14:14:00 2022-08-16 14:15:00 NA null
#> 28 401 INVH-ON IN 2022-08-16 14:15:00 2022-08-16 14:19:00 NA null
#> 29 401 NIVH-ON IN 2022-08-16 14:19:00 2022-08-16 15:04:00 NA null
#> 30 401 NIVH-ON IN 2022-08-16 15:04:00 2022-08-16 15:09:00 NA null
#> 31 401 NIVH-ON IN 2022-08-16 15:09:00 2022-08-16 15:24:00 NA null
#> 32 401 NIVH-ON OUT 2022-08-16 15:24:00 2022-08-16 15:26:00 NA null
#> 33 401 INVH-ON OUT 2022-08-16 15:26:00 2022-08-16 16:04:00 401 1fb84
#> 34 401 INVH-ON OUT 2022-08-16 15:26:00 2022-08-16 16:04:00 401 1fb84
#> 35 401 INVH-ON OUT 2022-08-16 15:26:00 2022-08-16 16:04:00 401 1b1c4
#> 36 401 INVH-ON IN 2022-08-16 16:04:00 2022-08-16 16:07:00 NA null
#> 37 401 NIVH-ON IN 2022-08-16 16:07:00 2022-08-16 16:13:00 NA null
#> 38 401 INVH-ON IN 2022-08-16 16:13:00 2022-08-16 16:15:00 NA null
#> 39 401 NIVH-ON IN 2022-08-16 16:15:00 2022-08-16 16:20:15 NA null
#> 40 401 NIVH-DOCK IN 2022-08-16 16:20:15 2022-08-17 02:00:00 NA null
#> # … with 3 more variables: date <int>, start_time <dttm>, endtime <dttm>
Created on 2022-08-25 with reprex v2.0.2
Please note: Finally, you can tidy up the columns if you want.
Upvotes: 2