Reputation: 23
I've run into the problem of combining a large number of datasets in R (858, to be precise) and wrote a couple lines of code to solve this as memory efficient as possible. The code below is working, but I'm wondering if there is a faster way of doing this. While working fine and memory use being stable, the process does slow down considerably with every dataset being added. Past 350 or so, it becomes excrutiatingly slow. Not impossible to run, but I'm hoping for some way to optimize this.
Each dataset consists of around 17.000 rows over 88 columns, with a rather large amount of duplicates between datasets. It's Twitter data, so the status_id variable serves as a good indicator of duplicate rows. I'm using data.table to make the process more efficient. Duplicates are being cleaned out, the dataset just added removed and gc() is called after a dataset is added. I'm suspecting the unique() function to slow down as the dataset grows, so chunking this process might be an option here. But maybe there's something else that I'm not seeing?
filenames <- list.files(pattern = "dataset_*")
full_data <- data.table()
for (i in 1:length(filenames)){
load(filenames[i])
print(paste(i, "/", length(filenames), ":", filenames[i])) # indicates current dataset as number of total datasets
dataset <- as.data.table(get(ls(pattern="dataset_")))
full_data <- rbind(full_data, dataset) # add new dataset
rm(list=ls(pattern="dataset_")) # remove dataset
full_data <- unique(full_data, fromLast = T, by="status_id") #remove duplicate tweets in data
gc(verbose = F) # call gc to clear out RAM
}
Any input on the issue welcome, I'm trying to get into a habit of making my R code as efficient as possible :)
Upvotes: 1
Views: 928
Reputation: 23
So, I've done some tinkering and come to a first solution. Batch combining the datasets with rbindlist indeed proved to be a great efficiency boost.
I believe this does have to do mostly with the pre-allocation of size that rbindlist does automatically. I was struggling a bit with finding an elegant way to pre-allocate the dataset that did not involve hand-naming all 88 columns and wildly guessing the final size of the dataset (due to the removal of duplicates in the process). So I didn't benchmark the rbindlist solution against an rbind solution with pre-allocated dataset size.
So here's a solution for batching the files and combining them via rbindlist. The process of combining all 858 datasets onto an initial dataset of ~236000 tweets of an initial search clocked in at 909.68 system.time() and a dataset of ~2.5 million rows.
filenames <- list.files(pattern = "dataset_*")
full_data <- as.data.table(data_init)
for (i in seq(1,length(filenames),13)){ # sequence into batches (here: 13)
files <- filenames[i:(i+12)] # loop through the batched files to load
for (j in 1:length(files)) {
load(files[j])
print(paste((i+j-1), "/", length(filenames), ":", files[j]))} # indicates current dataset as number of total datasets
full_data <- rbindlist(mget(c(ls(pattern="dataset"), ls(pattern="full_data")))) # add new datasets
print("- batch combined -")
rm(list=ls(pattern="dataset")) # remove data
full_data <- unique(full_data, fromLast = T, by="status_id") #remove duplicate tweets in data
}
I've divided them into batches of 13, since that evens out nicely with the 858 datasets. Some testing showed that batches of ~8 datasets might be more efficient. Not sure what would be an elegant way to deal with the number of files not evening out with the batches though.
Upvotes: 0
Reputation: 851
A nicer way to do this could be to process the datasets in batches rather than iteratively. You could pass l
a list
of multiple data.tables and bind them together simultaneously with rbindlist
. Then you'd setkeyv
to set the key for the data.table
object. Calling unique.data.table
on a data.table
with a set key should be faster, but it may be equivalent to your use of by
.
l <- list(full_data, data)
full_data <- rbindlist(l)
setkeyv(full_data, "status_id")
full_data <- unique(full_data)
Upvotes: 2