Reputation: 1
can somebody help with debugging a function. It is meant to do
dat3 <- c(4,7,5,7,8,4,4,4,4,4,4,7,4,4,8,8,5,5,5,5)
myfunc(dat3, chunksize = 8)
## [1] 4 7 5 8 4 4 4 4 4 7 5 8 4 4 5 5 4
partition the data in chunks of a sizer and make sure that every level is present in every chunk. The function works for the toy example
myfunc <- function(x, chunksize = 8) {
numChunks <- ceiling(length(x) / chunksize)
uniqx <- unique(x)
lastChunkSize <- chunksize * (1 - numChunks) + length(x)
## check to see if it is mathematically possible
if (length(uniqx) > chunksize)
stop('more factors than can fit in one chunk')
if (any(table(x) < numChunks))
stop('not enough of at least one factor to cover all chunks')
if (lastChunkSize < length(uniqx))
stop('last chunk will not have all factors')
## actually arrange things in one feasible permutation
allIndices <- sapply(uniqx, function(z) which(z == x))
## fill one of each unique x into chunks
chunks <- lapply(1:numChunks, function(i) sapply(allIndices, `[`, i))
remainder <- unlist(sapply(allIndices, tail, n = -3))
remainderCut <- split(remainder, ceiling(seq_along(remainder)/4))
## combine them all together, wary of empty lists
finalIndices <- sapply(1:numChunks,
function(i) {
if (i <= length(remainderCut))
c(chunks[[i]], remainderCut[[i]])
else
chunks[[i]]
})
save(finalIndices,file="finalIndices")
x[unlist(finalIndices)]
}
the problem is that I want to get the rearranged indixes from the function (so what is called here final Indices). The problem is that for my real data set with more observations (https://www.dropbox.com/s/n3wc5qxaoavr4ta/j.RData?dl=0), the function does not work.
The data as factor https://www.dropbox.com/s/0ue2xzv5e6h858q/t.RData?dl=0
I change the chunkszie paramter according to the number of levels present to 9847 I in the first line of the function). The problem is that when I access finalIndices from the saved file, I get a matrix with dim 137 60. Which does not provide an index for all my observations (nearly 600k). Could somebody tell me what am i doing wrong? I know that 60 is the number of chunks (nrows/chunksize) but 137 appear not to fit.
Upvotes: 0
Views: 166
Reputation: 13314
The line remainderCut <- split(remainder, ceiling(seq_along(remainder)/4))
is hard-coded to the toy data set and just adds four elements to each chunk, which produces wrong results for other data sets.
Whereas this problem can be fixed by modifying your code, I have come up with a slightly different approach to this problem:
library(data.table)
generate.chunks <- function(dat3, chunksize = 8) {
# get number of unique values
freqs <- table(dat3)
# get chunk sizes
chunk.sizes <- rep(chunksize,length(dat3) %/% chunksize)
last.chunk.size <- length(dat3) %% chunksize
if (last.chunk.size > 0) chunk.sizes <- c(chunk.sizes,last.chunk.size)
# few checks
if (chunksize < length(freqs))
stop(sprintf('Chunk size is smaller than the number of factors: %i elements in a chunk, %i factors. Increase the chunk size',chunksize,length(freqs)))
if (chunk.sizes[length(chunk.sizes)] < length(freqs))
stop(sprintf('Last chunk size is smaller than the number of factors: %i elements in the chunk, %i factors. Use a different chunk size',chunksize,length(freqs)))
if (min(freqs) < length(chunk.sizes))
stop(sprintf('Not enough values in a factor to populate every chunk: %i < %i. Increase the chunk size',min(freqs),length(chunk.sizes)))
# make sure that each chunk has at least one factor
d.predefined <- data.frame(
chunk = rep(1:length(chunk.sizes),each=length(freqs)),
i = rep(1:length(freqs),length(chunk.sizes))
)
# randomly distribute the remaining values
d.sampled <- data.frame(
chunk = unlist(mapply(rep,1:length(chunk.sizes),chunk.sizes - length(freqs),SIMPLIFY=F)),
i = sample(unlist(mapply(rep,1:length(freqs),freqs - length(chunk.sizes))))
)
# put the predefined and sampled results together and split
d.result <- rbind(d.predefined,d.sampled)
# calculate indices
indices <- sapply(names(freqs),function(s) which(dat3==s))
dt <- as.data.table(d.result)
dt[,ind:=indices[[i]],by=i]
finalIndices <- split(dt$ind,dt$chunk)
save(finalIndices,file="finalIndices")
names(freqs)[d.result$i]
}
Upvotes: 1