Reputation: 391
I have event id (i.e., oid
) and dyad-level observations associated with each event: agent (i.e., aid
), partner (i.e., pid
). The events are sorted by time when the event occurs (i.e., o4.in
). Also, I have a variable to indicate whether a pair of values in two columns appear in the previous event (i.e., j2.consecutive
).
I used ddply
to aggregate the dummy variable by groups, as specified in the below example (e.g., t2
to create aj1.consecutive
and apj1.consecutive
). However, with 4m observations, the code takes too much time and Rstudio is terminated often while running the code.
Is there a faster & compact way to achieve the same task other than applying ddply
?
library(tidyverse)
library(tibble)
library(data.table)
rename <- dplyr::rename
select <- dplyr::select
set.seed(10001)
cases <- sample(1:5, 1000, replace=T)
set.seed(10002)
agent <- sample(1:20, 1000, replace=T)
set.seed(10003)
partner <- sample(1:20, 1000, replace=T)
set.seed(123)
n <- 1000 # no of random datetimes needed
minDate <- as.POSIXct("1999/01/01")
maxDate <- as.POSIXct("2000-01-01")
epoch <- "1970-01-01"
timestamps <-
as.POSIXct(pmax(runif(n, minDate, maxDate), runif(n, minDate, maxDate)), origin = epoch)
df <-
data.frame(cases, agent, partner, timestamps) %>%
rename(
aid = agent,
pid = partner,
oid = cases,
o4.in = timestamps
) %>%
filter(aid != pid) %>%
arrange(o4.in)
t <- setDT(df)[order(o4.in)]
t[, oid.lag.a := shift(oid), by = aid
][, oid.lag.p := shift(oid), by = pid]
t <-
t[, j2.consecutive := fcoalesce(+(oid.lag.a == oid.lag.p), 0L)] %>%
arrange(aid, o4.in)
# aggregating the dummy variable by groups
t2 <-
t %>%
ungroup %>%
ddply(c('oid', 'aid'), function(i){
i %>%
mutate(aj1.consecutive = (sum(j2.consecutive) - j2.consecutive)/(n()-1))
} , .progress = 'text') %>%
arrange(oid, pid) %>%
ddply(c('oid', 'pid'), function(i){
i %>%
mutate(apj1.consecutive = (sum(j2.consecutive) - j2.consecutive)/(n()-1))
} , .progress = 'text') %>%
arrange(aid, o4.in)
Upvotes: 1
Views: 86
Reputation: 887531
A more efficient data.table
option would be with :=
and set
functions. According to ?':='
set is a low-overhead loop-able version of :=. It is particularly useful for repetitively updating rows of certain columns by reference (using a for-loop).
Also, based on ?setorder
setorder (and setorderv) reorders the rows of a data.table based on the columns (and column order) provided. It reorders the table by reference and is therefore very memory efficient.
Below code assigns (:=
) by reference grouped by 'oid', 'aid' or 'oid', 'pid' and does the order with setorder
thus making it more efficient. The copy
was made on the original object so that it won't get changed while doing the assignment
library(data.table)
t3 <- copy(t)
t3[, aj1.consecutive := (sum(j2.consecutive) -
j2.consecutive)/(.N-1), .(oid, aid)]
setorder(t3, oid, pid)
t3[, apj1.consecutive := (sum(j2.consecutive) -
j2.consecutive)/(.N-1), .(oid, pid)]
setorder(t3, aid, o4.in)
-checking with OP's output
all.equal(t2, t3, check.attributes = FALSE)
[1] TRUE
Upvotes: 2