M_Stringer
M_Stringer

Reputation: 1

Count instance of datetime overlap across all rows in R dataframe

hoping someone can assist me here. i've tried searching but nothing seems to match what i'm trying to do.

I'm trying to calculate, for each row in my dataframe, the number of instances that the rows datetime is repeated across time ranges in any other row.

I have a data frame which contains 3 datetimes columns, which are POSIXt, format= dd/mm/yyyy HH:MM.

I'd like my calculation to be in a new column called "duplicates"

|Row  | :Start_time:     | :Start_time_beg: | :Start_time_end:|
|--   |------------------|------------------|-----------------|
|1    | 01/01/2017 03:00 | 01/01/2017 01:30 | 01/01/2017 04:30|
|2    | 01/01/2017 04:00 | 01/01/2017 02:30 | 01/01/2017 05:30|
|3    | 01/01/2017 04:10 | 01/01/2017 02:40 | 01/01/2017 05:40|
|4    | 01/01/2017 05:00 | 01/01/2017 03:30 | 01/01/2017 06:30|
|5    | 01/01/2017 08:00 | 01/01/2017 06:30 | 01/01/2017 09:30|

So in the above example data i'd like to count every instance that Start_time occurs in the range Start_time_beg : Start_time_end for rows 1:n

The results for this data would be:

|Row     |Duplicates|
|----    |----------|
|:1:     | :3:      | (3 as overlaps with rows 1,2,3)
|:2:     | :4:      | (4 overlaps with rows 1,2,3,4)
|:3:     | :4:      | (4 overlaps with rows 1,2,3,4)
|:4:     | :3:      | (3 overlaps with rows 2,3,4)
|:5:     | :1:      | (1 as only overlaps with itself, row 5)

my thought was to create a seq array for each Start_time_beg:Start_time_End. Then create a data frame, with count of Start_time from that. I could then join this back onto the original df.

so far I have

x <- d1$Start_Time
y <- d1$Start_Time_Beg
z <- d1$Start_Time_End


t <- seq(y[1],z[1],"mins")
t2<- seq(y[2],z[2],"mins")

tn <- c(t,t2)

p<-count(tn,'tn')

Which gives me the desired df(p) from the time range array. The problem is I have tried to create a loop to generate t:nrows (rows goes into thousands so can't be manually typed) but i'm having no look

for (i in 1:length(d1$Start_Time))
{seq(d$Start_Time_Beg[c(1+i)],d$Start_Time_End[c(1+i)],"mins")}

This just gives me an int length = nrows. Not the array of datetimes I was after.

I'm not even sure if this is the right way to go about this i've had a bash at trying to use dplyr but no luck.

Any help much appreciated. Apologies my tables don't seem to have aligned properly

Thanks in advance for any help

Upvotes: 0

Views: 154

Answers (1)

Uwe
Uwe

Reputation: 42544

With data.table this is a one-liner:

library(data.table)   # CRAN verison 1.10.4 used
setDT(DT)
DT[DT, on = .(Start_time >= Start_time_beg, Start_time <= Start_time_end), 
   Duplicates := .N, by = .EACHI][]
     Row          Start_time      Start_time_beg      Start_time_end Duplicates
   <int>              <POSc>              <POSc>              <POSc>      <int>
1:     1 2017-01-01 03:00:00 2017-01-01 01:30:00 2017-01-01 04:30:00          4
2:     2 2017-01-01 04:00:00 2017-01-01 02:30:00 2017-01-01 05:30:00          3
3:     3 2017-01-01 04:10:00 2017-01-01 02:40:00 2017-01-01 05:40:00          3
4:     4 2017-01-01 05:00:00 2017-01-01 03:30:00 2017-01-01 06:30:00          3
5:     5 2017-01-01 08:00:00 2017-01-01 06:30:00 2017-01-01 09:30:00          1

Explanation

After coersion to class data.table, DT is joined with itself using non-equi joins. The multiple matching rows are immediately counted (.N) by the join parameters (grouping by each i). Finally, the count is assigned to a new column of DT (update on join).

Data

library(data.table)
options(datatable.print.class = TRUE)

DT <- fread(
  "|Row  | Start_time     | Start_time_beg | Start_time_end|
  |1    | 01/01/2017 03:00 | 01/01/2017 01:30 | 01/01/2017 04:30|
  |2    | 01/01/2017 04:00 | 01/01/2017 02:30 | 01/01/2017 05:30|
  |3    | 01/01/2017 04:10 | 01/01/2017 02:40 | 01/01/2017 05:40|
  |4    | 01/01/2017 05:00 | 01/01/2017 03:30 | 01/01/2017 06:30|
  |5    | 01/01/2017 08:00 | 01/01/2017 06:30 | 01/01/2017 09:30|",
  sep = "|", drop = c(1, 6))
cols <- stringr::str_subset(names(DT), "time")
DT[, (cols) := lapply(.SD, lubridate::dmy_hm), .SDcols = cols]
DT
     Row          Start_time      Start_time_beg      Start_time_end
   <int>              <POSc>              <POSc>              <POSc>
1:     1 2017-01-01 03:00:00 2017-01-01 01:30:00 2017-01-01 04:30:00
2:     2 2017-01-01 04:00:00 2017-01-01 02:30:00 2017-01-01 05:30:00
3:     3 2017-01-01 04:10:00 2017-01-01 02:40:00 2017-01-01 05:40:00
4:     4 2017-01-01 05:00:00 2017-01-01 03:30:00 2017-01-01 06:30:00
5:     5 2017-01-01 08:00:00 2017-01-01 06:30:00 2017-01-01 09:30:00

Upvotes: 1

Related Questions