Reputation: 381
I would like to count the number of past occurrences of each pair of two IDs within a time frame in R.
Specifically, in the below example dataset, I have 10 people who work in a same company. Two workers work as a pair, and they can work together zero to multiple times on a given date in different rooms. I would like to count how many times they worked together in the room previously within 6 months time frame of each observed date (i.e., when the data is ordered from earliest date to the latest date, how many times a pair appeared in the room within past 6 months before the current date?)
Also, I used two methods to calculate it after searching, and found that method 1 (ddply) generates the desired output, but method 2 (Data.Table) generates incorrect output but similar to the desired one. Because method 2 takes much less time with large observations in my original dataset, I would like to also know how to correct my method 2 code.
Comparison code at the end is provided so that you can easily compare two outputs.
I appreciate it for your help.
library(tidyverse)
library(tibble)
rename <- dplyr::rename
select <- dplyr::select
set.seed(10000)
room <- sample(1:5, 1000, replace=T)
set.seed(10001)
agent <- sample(1:10, 1000, replace=T)
set.seed(10000)
partner <- sample(1:10, 1000, replace=T)
set.seed(10000)
date <- sample(seq(as.Date('1999/01/01'), as.Date('2000/01/01'), by="day"), 1000, replace=T)
df <-
data.frame(room, agent, partner, date) %>%
rowid_to_column %>%
rename(
aid = agent,
pid = partner,
o3.room = room,
o4.in = date,
oid = rowid
) %>%
filter(aid != pid) %>%
arrange(o3.room, aid, pid, o4.in) %>%
mutate(cases = 1) %>% # for cumsum in t1
mutate(o4.in_6mos = o4.in %m-% months(6)) # for t2
t1 <-
df %>%
ddply(c('aid', 'pid', 'o3.room'), function(i){
i %>%
arrange(aid, pid, o3.room, o4.in) %>%
filter(o4.in > o4.in %m-% months(6)) %>%
mutate(j1.room = cumsum(cases)-1)
}, .progress = 'text') %>%
select(oid, o4.in, o3.room, aid, pid, j1.room) %>%
arrange(o3.room, aid, pid, o4.in)
Where I modified the answers from a post in Stock Overflow.
t2 <-
df %>%
select(oid, o3.room, o4.in) %>%
cbind(
setDT(df)[df, .(j1.room = .N),
on = .(o3.room, aid, pid, o4.in < o4.in, o4.in > o4.in_6mos),
by = .EACHI] %>%
select(aid, pid, j1.room)
) %>%
arrange(o3.room, aid, pid, o4.in)
t_compare <-
t1 %>%
select(-o4.in) %>%
rename(j1.room1 = j1.room) %>%
left_join(
t2 %>% rename(j1.room2 = j1.room),
by = c('o3.room', 'aid', 'pid', 'oid')
) %>%
arrange(o3.room, aid, pid, o4.in) %>%
mutate(j3.room = ifelse(j1.room1 != j1.room2, 'non-match', '-')) %>%
mutate(j2.room = ifelse(j1.room1 != j1.room2, '0', '1'))
Upvotes: 0
Views: 51
Reputation: 8506
To do the same steps with data.table
, you could do, for example:
# used a different seed for `partner` to generate `df`
library(data.table)
library(lubridate)
ks <- c('aid', 'pid', 'o3.room')
DT <- data.table(df, key=ks)[
o4.in > o4.in %m-% months(6)][, j1.room:=cumsum(cases)-1, by=ks][
,.(oid, o4.in, o3.room, aid, pid, j1.room)]
setorder(DT, o3.room, aid, pid, o4.in)[]
# check if you get the same result:
identical(DT, as.data.table(t1))
Upvotes: 1