Neal Barsch
Neal Barsch

Reputation: 2940

Split large directory of files into n chunks where chunks are aprox equal total file size [R]

I'd like to split a large directory of files into separate lists of aprox. equal total file size chunks. The idea is to split a huge directory full of csv of different sizes into file lists of similar total sizes for further processing.

Reproduce fake file data in R:

###reproduce fake file data (just the significant columns from file.info)
filedata <- data.frame(size=sample(c(20:4000000),10000),isdir=FALSE,stringsAsFactors = F)
rownames(filedata)<-paste0("MYDIR/mycsv",c(1:nrow(filedata)),".csv")

The output (ideally) would be the filedata data.frame split into for example ten chunks (variable number) of aprox equal filesize NOT equal number of files:

nchunks <- 10
listofchunks <- function(split filedata into chunks by equal size and return as list of data frames)
###ideal output would be then chunk1, chunk2, etc. -chunk10 each with a unique list of files that where the total file size cumulatively is close as possible to the other chunks. 

Thanks!

Upvotes: 0

Views: 163

Answers (3)

Rui Barradas
Rui Barradas

Reputation: 76402

The following function might be what the question asks for. Untested.

listofchunks <- function(files, nchunks, ...){
  S <- cumsum(as.numeric(files[['size']]))
  f <- (sum(files[['size']]) %/% nchunks) * (0:(nchunks - 1))
  f <- findInterval(S, c(f, Inf))
  sp <- split(row.names(filedata), f)
  lapply(sp, function(x){
    res <- lapply(x, read.csv, ...)
    names(res) <- x
    res
  })
}

listofchunks(filedata, nchunks = 10)

Upvotes: 0

lroha
lroha

Reputation: 34291

Another option is to use the bin packing function from the BBmisc package.

library(BBmisc)
library(dplyr)
library(tibble)

listofchunks <- filedata %>% 
  rownames_to_column() %>%
  mutate(sizeMB = size / 2^20) %>% # Avoid integer overflow by changing unit to MB
  mutate(bins = binPack(sizeMB, sum(sizeMB) / 10 * 1.01 )) %>%
  group_split(bins)

Check size of bins:

map_dbl(listofchunks, ~ sum(.x$sizeMB))

[1] 1918.254 1918.254 1918.253 1918.253 1918.254 1918.254 1918.254 1918.254 1918.253 1728.331

Note that this is not an optimization function and the last bin will always be the smallest.

Upvotes: 2

Allan Cameron
Allan Cameron

Reputation: 173793

You could split the data frame by deciles of the cumulative sum of file sizes. This seems to work on local testing.

listofchunks <- function(path, n_chunks)
{
  filedata        <- data.frame(names = list.files(path, full.names = TRUE),          
                                stringsAsFactors = FALSE)
  filedata$sizes  <- sapply(filedata$names, file.size)
  filedata$decile <- cumsum(filedata$sizes) %/% (sum(filedata$sizes)/(n_chunks - 0.01))
  split(filedata, filedata$decile)
}

Upvotes: 0

Related Questions