Counting the number of occurrences of current pair of two IDs within a specific past time length in R

Question

Purpose

I would like to count the number of past occurrences of each pair of two IDs within a time frame in R.

Specifically, in the below example dataset, I have 10 people who work in a same company. Two workers work as a pair, and they can work together zero to multiple times on a given date in different rooms. I would like to count how many times they worked together in the room previously within 6 months time frame of each observed date (i.e., when the data is ordered from earliest date to the latest date, how many times a pair appeared in the room within past 6 months before the current date?)

Current Progress

Also, I used two methods to calculate it after searching, and found that method 1 (ddply) generates the desired output, but method 2 (Data.Table) generates incorrect output but similar to the desired one. Because method 2 takes much less time with large observations in my original dataset, I would like to also know how to correct my method 2 code.

Comparison code at the end is provided so that you can easily compare two outputs.

I appreciate it for your help.

Dataset

  library(tidyverse)
  library(tibble)
  
  rename <- dplyr::rename
  select <- dplyr::select
  
  set.seed(10000)
  room <- sample(1:5, 1000, replace=T)
  
  set.seed(10001)
  agent <- sample(1:10, 1000, replace=T)
  
  set.seed(10000)
  partner <- sample(1:10, 1000, replace=T)
  
  set.seed(10000)
  date <- sample(seq(as.Date('1999/01/01'), as.Date('2000/01/01'), by="day"), 1000, replace=T)

  df <-
    data.frame(room, agent, partner, date) %>% 
    rowid_to_column %>% 
    rename(
      aid = agent,
      pid = partner,
      o3.room = room,
      o4.in = date,
      oid = rowid
    ) %>% 
    filter(aid != pid) %>% 
    arrange(o3.room, aid, pid, o4.in) %>%
    mutate(cases = 1) %>%  # for cumsum in t1
    mutate(o4.in_6mos = o4.in %m-% months(6))  # for t2

Method1 - ddply

  t1 <- 
    df %>% 
    ddply(c('aid', 'pid', 'o3.room'), function(i){
      i %>% 
        arrange(aid, pid, o3.room, o4.in) %>% 
        filter(o4.in > o4.in %m-% months(6)) %>% 
        mutate(j1.room = cumsum(cases)-1)
    }, .progress = 'text') %>% 
    select(oid, o4.in, o3.room, aid, pid, j1.room) %>% 
    arrange(o3.room, aid, pid, o4.in)

Method2 - Data.Table

Where I modified the answers from a post in Stock Overflow.

  t2 <- 
    df %>% 
    select(oid, o3.room, o4.in) %>% 
    cbind(
      setDT(df)[df, .(j1.room = .N), 
                on = .(o3.room, aid, pid, o4.in < o4.in, o4.in > o4.in_6mos), 
                by = .EACHI] %>% 
        select(aid, pid, j1.room) 
    ) %>% 
    arrange(o3.room, aid, pid, o4.in)

Comparison

  t_compare <- 
    t1 %>% 
    select(-o4.in) %>% 
    rename(j1.room1 = j1.room) %>% 
    left_join(
      t2 %>% rename(j1.room2 = j1.room),
      by = c('o3.room', 'aid', 'pid', 'oid')
    )  %>% 
    arrange(o3.room, aid, pid, o4.in) %>% 
    mutate(j3.room = ifelse(j1.room1 != j1.room2, 'non-match', '-')) %>% 
    mutate(j2.room = ifelse(j1.room1 != j1.room2, '0', '1'))

user12728748 · Accepted Answer

To do the same steps with data.table, you could do, for example:


# used a different seed for `partner` to generate `df`

library(data.table)
library(lubridate)
ks <- c('aid', 'pid', 'o3.room')
DT <- data.table(df, key=ks)[
     o4.in > o4.in %m-% months(6)][, j1.room:=cumsum(cases)-1, by=ks][
         ,.(oid, o4.in, o3.room, aid, pid, j1.room)]
setorder(DT, o3.room, aid, pid, o4.in)[]

# check if you get the same result:
identical(DT, as.data.table(t1))