Reputation: 30311
I have a function in R that chokes if I apply it to a dataset with more than 1000 rows. Therefore, I want to split my dataset into a list of n chunks, each of not more than 1000 rows.
Here's the function I'm currently using to do the chunking:
chunkData <- function(Data,chunkSize){
Chunks <- floor(0:(nrow(Data)-1)/(chunkSize))
lapply(unique(Chunks),function(x) Data[Chunks==x,])
}
chunkData(iris,100)
I would like to make this function more efficient, so that it runs faster on large datasets.
Upvotes: 3
Views: 3776
Reputation: 60944
You should also take a look at ddply
fom the plyr
package, this package is built around the split-apply-combine principle. This paper about the package explains how this works and what things are available in plyr.
The general strategy I would take here is to add a new data to the dataset called chunkid
. This cuts up the data in chunks of 1000 rows, look at the rep
function to create this row. You can then do:
result = ddply(dat, .(chunkid), functionToPerform)
I like plyr
for its clear syntax and structure, and its support of parallel processing. As already said, please also take a look at data.table
, which could be quite a bit faster in some situations.
An additional tip could be to use matrices in stead of data.frames...
Upvotes: 2
Reputation: 174813
Replace the lapply()
call with a call to split()
:
split(Data, Chunks)
Upvotes: 3
Reputation: 55695
You can do this easily using split
from base
R. For example, split(iris, 1:3)
, will split the iris
dataset into a list of three data frames by row. You can modify the arguments to specify a chunk size.
Since the output is still a list of data frames, you can easily use lapply
on the output to process the data, and combine them as required.
Since speed is the primary issue for using this approach, I would recommend that you take a look at the data.table
package, which works great with large data sets. If you specify more information on what you are trying to achieve in your function, people at SO might be able to help.
Upvotes: 7