Zach
Zach

Reputation: 30311

Divide a dataset into chunks

I have a function in R that chokes if I apply it to a dataset with more than 1000 rows. Therefore, I want to split my dataset into a list of n chunks, each of not more than 1000 rows.

Here's the function I'm currently using to do the chunking:

chunkData <- function(Data,chunkSize){
    Chunks <- floor(0:(nrow(Data)-1)/(chunkSize))
    lapply(unique(Chunks),function(x) Data[Chunks==x,])
}
chunkData(iris,100)

I would like to make this function more efficient, so that it runs faster on large datasets.

Upvotes: 3

Views: 3776

Answers (3)

Paul Hiemstra
Paul Hiemstra

Reputation: 60944

You should also take a look at ddply fom the plyr package, this package is built around the split-apply-combine principle. This paper about the package explains how this works and what things are available in plyr.

The general strategy I would take here is to add a new data to the dataset called chunkid. This cuts up the data in chunks of 1000 rows, look at the rep function to create this row. You can then do:

result = ddply(dat, .(chunkid), functionToPerform)

I like plyr for its clear syntax and structure, and its support of parallel processing. As already said, please also take a look at data.table, which could be quite a bit faster in some situations.

An additional tip could be to use matrices in stead of data.frames...

Upvotes: 2

Gavin Simpson
Gavin Simpson

Reputation: 174813

Replace the lapply() call with a call to split():

split(Data, Chunks)

Upvotes: 3

Ramnath
Ramnath

Reputation: 55695

You can do this easily using split from base R. For example, split(iris, 1:3), will split the iris dataset into a list of three data frames by row. You can modify the arguments to specify a chunk size.

Since the output is still a list of data frames, you can easily use lapply on the output to process the data, and combine them as required.

Since speed is the primary issue for using this approach, I would recommend that you take a look at the data.table package, which works great with large data sets. If you specify more information on what you are trying to achieve in your function, people at SO might be able to help.

Upvotes: 7

Related Questions