Reputation: 1
I have a couple of large daily datasets that I need to summarise and bind in R by month. Because the datasets are so large, I'd like to do the summarising in parallel so that it is faster. I've been successful in summarise and binding them with a regular loop, but the summarising portion takes all night.
The dataset looks something like this:
number id date
1 1 0102
1 1 0102
2 1 0102
2 2 0102
and I want
number id day count
1 1 0102 2
2 1 0102 1
2 2 0102 1
collapse_cdr<- function(data){
dta<- data %>%
group_by(number,date, id) %>%
summarise(count=n()) %>%
mutate(total.calls=sum(count)) %>%
slice(which.max(count))
}
wd<-("working directory")
cl <- makeCluster(8)
registerDoParallel(cl)
month = foreach(i=day_code, .combine=rbind, .packages=c("tidyverse","readr")) %dopar%
{ filename<-paste0(wd,"/", i, ".csv")
dta<-read_csv(filename, col_types = cols(.default = "c"))
dta$date <- i
dta<-collapse_cdr(data=dta)
data.frame(dta)
}
Right now I'm getting the warning closing unused connection 62 (<-localhost:11439)
Thank you!
Upvotes: 0
Views: 266
Reputation: 24722
I would suggest an approach using data.table
library(data.table)
library(doParallel)
library(foreach)
# Function to collapse the data
collapse_cdr <- function(d) {
d[, .(count=.N), .(number,date,id)][
,total.calls:=sum(count), .(number,date)][
, .SD[which.max(count)], .(number,date)]
}
wd<-("working directory")
cl <- makeCluster(8)
registerDoParallel(cl)
month = rbindlist(
foreach(i=day_code) %dopar% {
collapse_cdr(fread(paste0(wd,"/", i, ".csv"))[, date:=i])
}
)
stopCluster(cl)
Upvotes: 2