Marco Badici
Marco Badici

Reputation: 85

Number of overlaping datetime inside same table (R)

I have a table of about 50 000 rows, with four columns.

ID     Arrival             Departure             Gender

1   10/04/2015 23:14    11/04/2015 00:21           F
1   11/04/2015 07:59    11/04/2015 08:08           F
3   10/04/2017 21:53    30/03/2017 23:37           M
3   31/03/2017 07:09    31/03/2017 07:57           M
3   01/04/2017 01:32    01/04/2017 01:35           M
3   01/04/2017 13:09    01/04/2017 14:23           M
6   10/04/2015 21:31    10/04/2015 23:17           F
6   10/04/2015 23:48    11/04/2015 00:05           F
6   01/04/2016 21:45    01/04/2016 22:48           F
6   02/04/2016 04:54    02/04/2016 07:38           F
6   04/04/2016 18:41    04/04/2016 22:48           F
10  10/04/2015 22:39    11/04/2015 00:42           M
10  13/04/2015 02:57    13/04/2015 03:07           M
10  31/03/2016 22:29    01/04/2016 08:39           M
10  01/04/2016 18:49    01/04/2016 19:44           M
10  01/04/2016 22:28    02/04/2016 00:31           M
10  05/04/2017 09:27    05/04/2017 09:28           M 
10  06/04/2017 15:12    06/04/2017 15:43           M

This is a very small representation of the table. What I want to find out is, at the same time as each entry, how many others were present and then separate them by gender. So, say for example that at the time as the first presence of person with ID 1, person with ID 6 was present and person with ID 10 was present twice in the same interval. That would mean that at the same time, 2 other overlaps occurred. This also means that person with ID 1 has overlapped with 1 Male and 1 Female.

So its result should look like:

ID           Arrival            Departure         Males encountered        Females encountered
1       10/04/2015 23:14    11/04/2015 00:21             1                          1

How would I be able to calculate this? I have tried to work with foverlaps and have managed to solve this with Excel, but I would want to do it in R.

Upvotes: 4

Views: 161

Answers (2)

PavoDive
PavoDive

Reputation: 6496

Here is a data.table solution using foverlaps.

First, notice that there's an error in your data:

ID           Arrival           Departure      Gender
3   10/04/2017 21:53    30/03/2017 23:37           M

The user arrived almost one month after he actually left. I needed to get rid of that data in order for foverlaps to run.

library(data.table)

dt <- data.table(df)
dt <- dt[Departure > Arrival, ]  # filter wrong cases

setkey(dt, "Arrival", "Departure")  # prepare for foverlaps
dt2 <- copy(dt)  # use a different dt, inherits the key

run foverlaps and then

  • filter (leave only) the cases where arrival of second person is before than ID and same user-cases.
  • Add a variable where we count the male simultaneous guests and
  • a variable where we count the female simultaneous guests, all grouped by ID and arrival

.

simultaneous <- foverlaps(dt, dt2)[i.Arrival <= Arrival & ID != i.ID,
                                       .(malesEncountered = sum(i.Gender == "M"),
                                         femalesEncountered = sum(i.Gender == "F")), 
                                       by = .(ID, Arrival)]

Join the findings of the previous command with our original table on ID and arrival

result <- simultaneous[dt, on = .(ID, Arrival)]

<EDIT>: Convert to zero the NAs in malesEncountered and femalesEncountered: </EDIT>

result[is.na(malesEncountered), malesEncountered := 0][
                 is.na(femalesEncountered), femalesEncountered := o]

set the column order to something nicer

setcolorder(result, c(1, 2, 5, 6, 3, 4))[]

Upvotes: 1

Dan Olner
Dan Olner

Reputation: 11

Here's one possibility. This uses lubridate's interval and the int_overlaps function that finds date overlaps. That has a drawback though: Interval doesn't work with dplyr. So this version is just doing all the work manually in a for loop.

It starts by making a 1000 row random dataset that matches yours: each person arrives in a two year period and departs one or two days later.

It's taking about 24 seconds for 1000 to run so you can expect it to take a while for 50K! The for loop outputs the row number so you can see where it is though.

Any questions about the code, lemme know.

There must be a faster vectorised way but interval didn't seem to play nice with apply either. Someone else might have something quicker...

Final output looks like this

library(tidyverse)
library(lubridate)

#Sample data:
#(Date sampling code: https://stackoverflow.com/questions/21502332/generating-random-dates)
#Random dates between 2017 and 2019
x <- data.frame(
  ID = c(1:1000),
  Arrival = sample(seq(as.Date('2017/01/01'), as.Date('2019/01/01'), by="day"), 1000, replace = T),
  Gender = ifelse(rbinom(1000,1,0.5),'Male','Female')#Random Male female 50% probabiliity
)

#Make departure one or two days after arrival
x$Departure = x$Arrival + sample(1:2,1000, replace=T)


#Lubridate has a function for checking whether date intervals overlap
#https://lubridate.tidyverse.org/reference/interval.html

#So first, let's make the arrival and departure dates into intervals
x$interval <- interval(x$Arrival,x$Departure)


#Then for every person / row
#We want to know if their interval overlaps with the rest

#At the moment, dplyr doesn't play nice with interval
#https://github.com/tidyverse/dplyr/issues/3206

#So let's go through each row and do this manually
#Keep each person's result in list initially
gendercounts <- list()

#Check timing
t <- proc.time()

#Go through every row manually (sigh!
for(i in 1:nrow(x)){

  print(paste0("Row ",i))

  #exclude self (don't want to check date overlap with myself)
  overlapcheck <- x[x$ID != x$ID[i],]

  #Find out what dates this person overlaps with - can do all other intervals in one command
  overlapcheck$overlaps <- int_overlaps(x$interval[i],overlapcheck$interval)

  #Eyeball check that is finding the overlaps we want
  #Is this ID date overlapping? Tick
  #View(overlapcheck[overlapcheck$overlaps,])

  #Use dplyr to find out the number of overlaps for male and female
  #Keep only columns where the overlap is TRUE
  #Also drop the interval column first tho as dplyr doesn't like it... (not tidy!)
  gendercount <- overlapcheck %>% 
    select(-interval) %>% 
    filter(overlaps) %>% 
    group_by(Gender) %>%
    summarise(count = n()) %>% #Get count of observations for each overlap for each sex
    complete(Gender, fill = list(count = 0))#Need this to keep zero counts: summarise drops them otherwise


  #We want count for each gender in their own column, so make wide
  gendercount <- gendercount %>% 
    spread(key = Gender, value = count)

  #Store for turning into dataframe shortly
  gendercounts[[length(gendercounts)+1]] <- gendercount

}

#Dlyr command: turn list into dataframe
gendercounts <- bind_rows(gendercounts)

#End result. Drop interval column, order columns
final <- cbind(x,gendercounts) %>% 
  select(ID,Arrival,Departure,Gender,Male,Female)

#~24 seconds per thousand
proc.time()-t

Upvotes: 1

Related Questions