A_Skelton73
A_Skelton73

Reputation: 1190

Data Structures for Creating BIG Data in R

I'm writing a gene level analysis script in R and I'll have to handle large amounts of data.

My initial idea was to create a super list structure, a set of lists within lists. Essentially the structure is

#12.8 mins
list[[1:8]][[1:1000]][[1:6]][[1:1000]] 

This is huge and takes in excess of 12 mins purely to set up the data structure. Stream lining this process, I can get it down to about 1.6 mins when setting up one value of the 1:8 list, so essentially...

#1.6 mins
list[[1:1]][[1:1000]][[1:6]][[1:1000]] 

Normally, I'd create the structure as and when it's needed, on the fly, however, I'm distributing the 1:1000 steps which means, I don't know which order they'll come back in.

Are there any other packages for handling the creation of this level of data? Could I use any more efficient data structures in my approach?

I apologise if this seems like the wrong approach entirely, but this is my first time handling big data in R.

Upvotes: 2

Views: 460

Answers (2)

Hong Ooi
Hong Ooi

Reputation: 57686

Note that lists are vectors, and like any other vector, they can have a dim attribute.

l <- vector("list", 8 * 1000 * 6 * 1000)
dim(l) <- c(8, 1000, 6, 1000)

This is effectively instantaneous. You access individual elements with [[, eg l[[1, 2, 3, 4]].

Upvotes: 3

Martin Morgan
Martin Morgan

Reputation: 46866

A different strategy is to create a vector and a partitioning, e.g., to represent

list(1:4, 5:7)

as

l = list(data=1:7, partition=c(4, 7))

then one can do vectorized calculations, e.g.,

logl = list(data=log(l$data), partition = l$partition)

and other clever things. This avoids creating complicated lists and the iterations that implies. This approach is formalized in the Bioconductor IRanges package *List classes.

> library(IRanges)
> l <- NumericList(1:4, 5:7)
> l
NumericList of length 2
[[1]] 1 2 3 4
[[2]] 5 6 7
> log(l)
NumericList of length 2
[[1]] 0 0.693147180559945 1.09861228866811 1.38629436111989
[[2]] 1.6094379124341 1.79175946922805 1.94591014905531

One idiom for working with this data is to unlist, transform, then relist; both unlist and relist are inexpensive, so the long-hand version of the above is relist(log(unlist(l)), l)

Depending on your data structure, the DataFrame class may be appropriate, e.g., the following can be manipulated like a data.frame (subset, etc) but contains *List elements.

> DataFrame(Sample=c("A", "B"), VariableA=l, LogA=log(l))
DataFrame with 2 rows and 3 columns
       Sample     VariableA                                              LogA
  <character> <NumericList>                                     <NumericList>
1           A     1,2,3,...          0,0.693147180559945,1.09861228866811,...
2           B         5,6,7 1.6094379124341,1.79175946922805,1.94591014905531

For genomic data where the coordinates of genes (or other features) on chromosomes is of fundamental importance, the GenomicRanges package and GRanges / GRangesList classes are appropriate.

Upvotes: 2

Related Questions