J.K.
J.K.

Reputation: 391

A faster way than applying 'ddply' to aggregate a variable by a function by groups

Purpose

I have event id (i.e., oid) and dyad-level observations associated with each event: agent (i.e., aid), partner (i.e., pid). The events are sorted by time when the event occurs (i.e., o4.in). Also, I have a variable to indicate whether a pair of values in two columns appear in the previous event (i.e., j2.consecutive).

I used ddply to aggregate the dummy variable by groups, as specified in the below example (e.g., t2 to create aj1.consecutive and apj1.consecutive). However, with 4m observations, the code takes too much time and Rstudio is terminated often while running the code.

Is there a faster & compact way to achieve the same task other than applying ddply?

Dataset

library(tidyverse)
  library(tibble)
  library(data.table)
  
  rename <- dplyr::rename
  select <- dplyr::select
  
  set.seed(10001)
  cases <- sample(1:5, 1000, replace=T)
  
  set.seed(10002)
  agent <- sample(1:20, 1000, replace=T)
  
  set.seed(10003)
  partner <- sample(1:20, 1000, replace=T)
    
  set.seed(123)
  n <- 1000  # no of random datetimes needed
  minDate <- as.POSIXct("1999/01/01")
  maxDate <- as.POSIXct("2000-01-01")
  
  epoch <- "1970-01-01"
  timestamps <- 
    as.POSIXct(pmax(runif(n, minDate, maxDate), runif(n, minDate, maxDate)), origin = epoch)
  
  df <-
    data.frame(cases, agent, partner, timestamps) %>% 
    rename(
      aid = agent,
      pid = partner,
      oid = cases,
      o4.in = timestamps
    ) %>% 
    filter(aid != pid) %>%
    arrange(o4.in) 

  t <- setDT(df)[order(o4.in)]
  t[, oid.lag.a := shift(oid), by = aid
  ][, oid.lag.p := shift(oid), by = pid]
  
  t <- 
    t[, j2.consecutive := fcoalesce(+(oid.lag.a == oid.lag.p), 0L)] %>% 
    arrange(aid, o4.in)

Current Method

 # aggregating the dummy variable by groups
 t2 <-
t %>%
ungroup %>%
ddply(c('oid', 'aid'), function(i){
  i %>%
    mutate(aj1.consecutive = (sum(j2.consecutive) - j2.consecutive)/(n()-1))
} , .progress = 'text') %>%
arrange(oid, pid) %>%
ddply(c('oid', 'pid'), function(i){
  i %>%
    mutate(apj1.consecutive = (sum(j2.consecutive) - j2.consecutive)/(n()-1))
} , .progress = 'text') %>%
arrange(aid, o4.in)

Upvotes: 1

Views: 86

Answers (1)

akrun
akrun

Reputation: 887531

A more efficient data.table option would be with := and set functions. According to ?':='

set is a low-overhead loop-able version of :=. It is particularly useful for repetitively updating rows of certain columns by reference (using a for-loop).

Also, based on ?setorder

setorder (and setorderv) reorders the rows of a data.table based on the columns (and column order) provided. It reorders the table by reference and is therefore very memory efficient.

Below code assigns (:=) by reference grouped by 'oid', 'aid' or 'oid', 'pid' and does the order with setorder thus making it more efficient. The copy was made on the original object so that it won't get changed while doing the assignment

library(data.table)
t3 <- copy(t)

t3[, aj1.consecutive :=  (sum(j2.consecutive) - 
            j2.consecutive)/(.N-1), .(oid, aid)]
setorder(t3, oid, pid)
t3[, apj1.consecutive := (sum(j2.consecutive) - 
            j2.consecutive)/(.N-1), .(oid, pid)]
setorder(t3, aid, o4.in)

-checking with OP's output

all.equal(t2, t3, check.attributes = FALSE)
[1] TRUE

Upvotes: 2

Related Questions