DanG
DanG

Reputation: 741

Join two datasets with nearst start time with interval fuzzy join

I am trying to join two large datasets in R with 'fuzzyjoin:interval_inner_join'. my goal is to join these to table base on nearest start and end time.

# first dataset

viewing <- data.frame(stringsAsFactors=FALSE,
                 id = c("100-16", "100-16", "100-16", "100-16", "100-16",
                        "100-16", "100-16", "100-16", "100-16", "100-16"),
      start_utc_day = c("2019-05-01", "2019-05-01", "2019-05-01", "2019-05-01",
                        "2019-05-01", "2019-05-01", "2019-05-01", "2019-05-01",
                        "2019-05-01", "2019-05-01"),
   start = c("7:18:45", "7:27:30", "7:59:30", "8:37:30", "8:41:15",
                        "8:47:15", "8:52:45", "8:55:30", "8:57:45", "9:05:00"),
     end = c("7:26:15", "7:59:15", "8:33:45", "8:40:30", "8:43:15",
                        "8:50:15", "8:55:15", "8:57:00", "9:00:00", "9:07:00")
)

# second dataset
location <- data.frame(stringsAsFactors=FALSE,
                 id = c("100-16", "100-16", "100-16", "100-16", "100-16",
                        "100-16", "100-16", "100-16", "100-16", "100-16"),
               code = c("IN", "IN", "IN", "IN", "IN", "IN", "IN", "IN", "IN",
                        "IN"),
            utc_day = c("2019-05-01", "2019-05-01", "2019-05-01", "2019-05-01",
                        "2019-05-01", "2019-05-01", "2019-05-01", "2019-05-01",
                        "2019-05-01", "2019-05-01"),
   start = c("7:13:30", "7:17:00", "7:22:00", "7:41:00", "8:14:15",
                        "8:33:45", "8:43:00", "9:08:45", "9:21:15", "9:32:00"),
     end = c("7:15:30", "7:20:30", "7:31:00", "7:43:00", "8:15:45",
                        "8:35:15", "8:45:30", "9:12:15", "9:23:00", "9:35:15")
)

Referring to This link I tried to use interval_semi_join:

library(fuzzyjoin)
interval_semi_join(viewing,location, minoverlap = 3)

but got this error

Joining by: c("id", "start", "end")

Error in index_match_fun(d1, d2) :
interval_join must join on exactly two columns (start and end)

Upvotes: 2

Views: 763

Answers (1)

Parfait
Parfait

Reputation: 107567

Consider creating the appropriate date/time fields then running the fuzzy join. Character columns cannot be used for numeric interval comparison or matching.

viewing <- within(viewing, {
   end_dt_time <- as.POSIXct(paste(start_utc_day, end), format="%Y-%m-%d %H:%M:%S") 
   start_dt_time <- as.POSIXct(paste(start_utc_day, start), format="%Y-%m-%d %H:%M:%S")
})

location <- within(location, {
   end_dt_time <- as.POSIXct(paste(utc_day, end), format="%Y-%m-%d %H:%M:%S") 
   start_dt_time <- as.POSIXct(paste(utc_day, start), format="%Y-%m-%d %H:%M:%S")
})

interval_semi_join(viewing, location, by=c("start_dt_time", "end_dt_time"), minoverlap=3)

Upvotes: 1

Related Questions