Reputation: 2940
I'd like to split a large directory of files into separate lists of aprox. equal total file size chunks. The idea is to split a huge directory full of csv of different sizes into file lists of similar total sizes for further processing.
Reproduce fake file data in R:
###reproduce fake file data (just the significant columns from file.info)
filedata <- data.frame(size=sample(c(20:4000000),10000),isdir=FALSE,stringsAsFactors = F)
rownames(filedata)<-paste0("MYDIR/mycsv",c(1:nrow(filedata)),".csv")
The output (ideally) would be the filedata data.frame split into for example ten chunks (variable number) of aprox equal filesize NOT equal number of files:
nchunks <- 10
listofchunks <- function(split filedata into chunks by equal size and return as list of data frames)
###ideal output would be then chunk1, chunk2, etc. -chunk10 each with a unique list of files that where the total file size cumulatively is close as possible to the other chunks.
Thanks!
Upvotes: 0
Views: 163
Reputation: 76402
The following function might be what the question asks for. Untested.
listofchunks <- function(files, nchunks, ...){
S <- cumsum(as.numeric(files[['size']]))
f <- (sum(files[['size']]) %/% nchunks) * (0:(nchunks - 1))
f <- findInterval(S, c(f, Inf))
sp <- split(row.names(filedata), f)
lapply(sp, function(x){
res <- lapply(x, read.csv, ...)
names(res) <- x
res
})
}
listofchunks(filedata, nchunks = 10)
Upvotes: 0
Reputation: 34291
Another option is to use the bin packing function from the BBmisc
package.
library(BBmisc)
library(dplyr)
library(tibble)
listofchunks <- filedata %>%
rownames_to_column() %>%
mutate(sizeMB = size / 2^20) %>% # Avoid integer overflow by changing unit to MB
mutate(bins = binPack(sizeMB, sum(sizeMB) / 10 * 1.01 )) %>%
group_split(bins)
Check size of bins:
map_dbl(listofchunks, ~ sum(.x$sizeMB))
[1] 1918.254 1918.254 1918.253 1918.253 1918.254 1918.254 1918.254 1918.254 1918.253 1728.331
Note that this is not an optimization function and the last bin will always be the smallest.
Upvotes: 2
Reputation: 173793
You could split the data frame by deciles of the cumulative sum of file sizes. This seems to work on local testing.
listofchunks <- function(path, n_chunks)
{
filedata <- data.frame(names = list.files(path, full.names = TRUE),
stringsAsFactors = FALSE)
filedata$sizes <- sapply(filedata$names, file.size)
filedata$decile <- cumsum(filedata$sizes) %/% (sum(filedata$sizes)/(n_chunks - 0.01))
split(filedata, filedata$decile)
}
Upvotes: 0