How to process and combine data.frames in a list with faster way

Question

Finally, I come to an issue that very slow data processing and appending rows of multiple data.frames. I use lapply and dplyr combination for data processing. OTH, the process becomes very slower as I have 20000 rows in each data frame multiplied with 100 files in the directory.

Currently this is a huge bottle neck for me as even after lapply process finishes I don't have enough memory to bind_rows process.

Here is my data processing method,

first make a list of files

files <- list.files("file_directory",pattern = "w.*.csv",recursive=T,full.names = TRUE)

then process this list of files

  library(tidyr)
  library(dplyr)

data<- lapply(files,function(x){
    tmp <- read.table(file=x, sep=',', header = T,fill=F,skip=0, stringsAsFactors = F,row.names=NULL)%>%

      select(A,B, C)%>%
      unite(BC,BC,sep='_')%>%

      mutate(D=C*A)%>%
      group_by(BC)%>%
      mutate(KK=median(C,na.rm=TRUE))%>%
      select(BC,KK,D)
  })

data <- bind_rows(data)

I am getting an error which says,

“Error: cannot allocate vector of size ... Mb” ...

Depends on how much left in my ram. I have 8 Gb ram but seems still struggling;(

I also tried do.call but nothing changed! Who is my friendly function or approach for this issue? I use R version 3.4.2 and dplyr 0.7.4.

talat · Accepted Answer

I can't test this answer since there's no reproducible data but I guess it could be something like the following, using data.table:

library(data.table)

data <- setNames(lapply(files, function(x) {
  fread(x, select = c("A", "B", "C"))
}), basename(files))

data <- rbindlist(data, use.names = TRUE, fill = TRUE, id = "file_id")
data[, BC := paste(B, C, sep = "_")]
data[, D := C * A]
data[, KK := median(C, na.rm = TRUE), by = .(BC, file_id)]
data[, setdiff(names(data), c("BC", "KK", "D")) := NULL]

How to process and combine data.frames in a list with faster way

first make a list of files

then process this list of files

Answers (2)

Related Questions