rteam22
rteam22

Reputation: 1

How do I summarise and bind large data sets in parallel using tidyverse in R?

I have a couple of large daily datasets that I need to summarise and bind in R by month. Because the datasets are so large, I'd like to do the summarising in parallel so that it is faster. I've been successful in summarise and binding them with a regular loop, but the summarising portion takes all night.

The dataset looks something like this:

number     id   date
1      1        0102
1      1        0102
2      1        0102
2      2        0102

and I want

number     id   day    count
1      1        0102    2
2      1        0102    1
2      2        0102   1
collapse_cdr<- function(data){
  dta<- data %>% 
    group_by(number,date, id) %>%
    summarise(count=n())  %>%
    mutate(total.calls=sum(count)) %>%
    slice(which.max(count))
  
  }
wd<-("working directory")

cl <- makeCluster(8)
registerDoParallel(cl)
month = foreach(i=day_code, .combine=rbind, .packages=c("tidyverse","readr")) %dopar%
 { filename<-paste0(wd,"/", i, ".csv")
    dta<-read_csv(filename, col_types = cols(.default = "c"))
    dta$date <- i
    dta<-collapse_cdr(data=dta)
    data.frame(dta)
  }

Right now I'm getting the warning closing unused connection 62 (<-localhost:11439)

Thank you!

Upvotes: 0

Views: 266

Answers (1)

langtang
langtang

Reputation: 24722

I would suggest an approach using data.table

library(data.table)
library(doParallel)
library(foreach)

# Function to collapse the data
collapse_cdr <- function(d) {
  d[, .(count=.N), .(number,date,id)][
    ,total.calls:=sum(count), .(number,date)][
      , .SD[which.max(count)], .(number,date)]
}

wd<-("working directory")

cl <- makeCluster(8)
registerDoParallel(cl)
month = rbindlist(
  foreach(i=day_code) %dopar% {
    collapse_cdr(fread(paste0(wd,"/", i, ".csv"))[, date:=i])
  }
)
stopCluster(cl)

Upvotes: 2

Related Questions