A faster way to combine a large number of datasets in R?

Question

I've run into the problem of combining a large number of datasets in R (858, to be precise) and wrote a couple lines of code to solve this as memory efficient as possible. The code below is working, but I'm wondering if there is a faster way of doing this. While working fine and memory use being stable, the process does slow down considerably with every dataset being added. Past 350 or so, it becomes excrutiatingly slow. Not impossible to run, but I'm hoping for some way to optimize this.

Each dataset consists of around 17.000 rows over 88 columns, with a rather large amount of duplicates between datasets. It's Twitter data, so the status_id variable serves as a good indicator of duplicate rows. I'm using data.table to make the process more efficient. Duplicates are being cleaned out, the dataset just added removed and gc() is called after a dataset is added. I'm suspecting the unique() function to slow down as the dataset grows, so chunking this process might be an option here. But maybe there's something else that I'm not seeing?

filenames <- list.files(pattern = "dataset_*")

full_data <- data.table()

for (i in 1:length(filenames)){
  load(filenames[i]) 
  print(paste(i, "/", length(filenames), ":", filenames[i]))  # indicates current dataset as number of total datasets
  dataset <- as.data.table(get(ls(pattern="dataset_"))) 
  full_data <- rbind(full_data, dataset)   # add new dataset
  rm(list=ls(pattern="dataset_"))  # remove dataset
  full_data  <- unique(full_data, fromLast = T, by="status_id")  #remove duplicate tweets in data
  gc(verbose = F)  # call gc to clear out RAM 
}

Any input on the issue welcome, I'm trying to get into a habit of making my R code as efficient as possible :)

Tim K · Accepted Answer

So, I've done some tinkering and come to a first solution. Batch combining the datasets with rbindlist indeed proved to be a great efficiency boost.

I believe this does have to do mostly with the pre-allocation of size that rbindlist does automatically. I was struggling a bit with finding an elegant way to pre-allocate the dataset that did not involve hand-naming all 88 columns and wildly guessing the final size of the dataset (due to the removal of duplicates in the process). So I didn't benchmark the rbindlist solution against an rbind solution with pre-allocated dataset size.

So here's a solution for batching the files and combining them via rbindlist. The process of combining all 858 datasets onto an initial dataset of ~236000 tweets of an initial search clocked in at 909.68 system.time() and a dataset of ~2.5 million rows.

filenames <- list.files(pattern = "dataset_*") 
   
full_data <- as.data.table(data_init)
    
  for (i in seq(1,length(filenames),13)){  # sequence into batches (here: 13)
    files <- filenames[i:(i+12)] # loop through the batched files to load 
    for (j in 1:length(files)) {
      load(files[j])
      print(paste((i+j-1), "/", length(filenames), ":", files[j]))} # indicates current dataset as number of total datasets
    full_data <- rbindlist(mget(c(ls(pattern="dataset"), ls(pattern="full_data"))))   # add new datasets
    print("- batch combined -")
    rm(list=ls(pattern="dataset"))  # remove data
    full_data <- unique(full_data, fromLast = T, by="status_id")  #remove duplicate tweets in data
  }

I've divided them into batches of 13, since that evens out nicely with the 858 datasets. Some testing showed that batches of ~8 datasets might be more efficient. Not sure what would be an elegant way to deal with the number of files not evening out with the batches though.

A faster way to combine a large number of datasets in R?

Answers (2)

Related Questions