Data Structures for Creating BIG Data in R

Question

I'm writing a gene level analysis script in R and I'll have to handle large amounts of data.

My initial idea was to create a super list structure, a set of lists within lists. Essentially the structure is

#12.8 mins
list[[1:8]][[1:1000]][[1:6]][[1:1000]]

This is huge and takes in excess of 12 mins purely to set up the data structure. Stream lining this process, I can get it down to about 1.6 mins when setting up one value of the 1:8 list, so essentially...

#1.6 mins
list[[1:1]][[1:1000]][[1:6]][[1:1000]]

Normally, I'd create the structure as and when it's needed, on the fly, however, I'm distributing the 1:1000 steps which means, I don't know which order they'll come back in.

Are there any other packages for handling the creation of this level of data? Could I use any more efficient data structures in my approach?

I apologise if this seems like the wrong approach entirely, but this is my first time handling big data in R.

Hong Ooi · Accepted Answer

Note that lists are vectors, and like any other vector, they can have a dim attribute.

l <- vector("list", 8 * 1000 * 6 * 1000)
dim(l) <- c(8, 1000, 6, 1000)

This is effectively instantaneous. You access individual elements with [[, eg l[[1, 2, 3, 4]].

Data Structures for Creating BIG Data in R

Answers (2)

Related Questions