Reputation: 1190
I'm writing a gene level analysis script in R and I'll have to handle large amounts of data.
My initial idea was to create a super list structure, a set of lists within lists. Essentially the structure is
#12.8 mins
list[[1:8]][[1:1000]][[1:6]][[1:1000]]
This is huge and takes in excess of 12 mins purely to set up the data structure. Stream lining this process, I can get it down to about 1.6 mins when setting up one value of the 1:8 list, so essentially...
#1.6 mins
list[[1:1]][[1:1000]][[1:6]][[1:1000]]
Normally, I'd create the structure as and when it's needed, on the fly, however, I'm distributing the 1:1000 steps which means, I don't know which order they'll come back in.
Are there any other packages for handling the creation of this level of data? Could I use any more efficient data structures in my approach?
I apologise if this seems like the wrong approach entirely, but this is my first time handling big data in R.
Upvotes: 2
Views: 460
Reputation: 57686
Note that lists are vectors, and like any other vector, they can have a dim
attribute.
l <- vector("list", 8 * 1000 * 6 * 1000)
dim(l) <- c(8, 1000, 6, 1000)
This is effectively instantaneous. You access individual elements with [[
, eg l[[1, 2, 3, 4]]
.
Upvotes: 3
Reputation: 46866
A different strategy is to create a vector and a partitioning, e.g., to represent
list(1:4, 5:7)
as
l = list(data=1:7, partition=c(4, 7))
then one can do vectorized calculations, e.g.,
logl = list(data=log(l$data), partition = l$partition)
and other clever things. This avoids creating complicated lists and the iterations that implies. This approach is formalized in the Bioconductor IRanges package *List
classes.
> library(IRanges)
> l <- NumericList(1:4, 5:7)
> l
NumericList of length 2
[[1]] 1 2 3 4
[[2]] 5 6 7
> log(l)
NumericList of length 2
[[1]] 0 0.693147180559945 1.09861228866811 1.38629436111989
[[2]] 1.6094379124341 1.79175946922805 1.94591014905531
One idiom for working with this data is to unlist
, transform, then relist
; both unlist
and relist
are inexpensive, so the long-hand version of the above is relist(log(unlist(l)), l)
Depending on your data structure, the DataFrame
class may be appropriate, e.g., the following can be manipulated like a data.frame
(subset, etc) but contains *List elements.
> DataFrame(Sample=c("A", "B"), VariableA=l, LogA=log(l))
DataFrame with 2 rows and 3 columns
Sample VariableA LogA
<character> <NumericList> <NumericList>
1 A 1,2,3,... 0,0.693147180559945,1.09861228866811,...
2 B 5,6,7 1.6094379124341,1.79175946922805,1.94591014905531
For genomic data where the coordinates of genes (or other features) on chromosomes is of fundamental importance, the GenomicRanges package and GRanges / GRangesList classes are appropriate.
Upvotes: 2