Reputation: 32978
When reading records from a database cursor it is often unknown in advance how many rows are coming up. This makes it impossible to pre-allocate a list of the right size to store these objects.
What is an efficient way of storing all records in a list, when we the total size is unknown? The basic list type is slow because it will copy the entire list every time an element is appended:
x <- list()
for(i in 1:1e5){
x[[i]] <- list("foo" = rnorm(3), bar = TRUE)
}
An environment is more efficient, but it is a map rather than an ordered set. So we need to convert the index into a string and then later sort the keys to retrieve the values, which seems suboptimal:
env <- new.env()
for(i in 1:1e5){
env[[sprintf("%09d", i)]] <- list("foo" = rnorm(3), bar = TRUE)
}
x <- lapply(sort(ls(env)), get, env, inherits = FALSE)
A pairlist
is supposed to be a linked list in R, however R seems to coerse it into a regular list whenever an element is appended from within R.
Upvotes: 9
Views: 488
Reputation: 6682
An example of a repeatedly doubling list implementation is in my answer at https://stackoverflow.com/a/32870310/264177. It is not implemented as linked list but as an expanding array. It is quite a bit faster than other alternatives for large data sets.
I actually built that for the same problem you are describing here, storing a large number of items retrieved from a database where you do not know the number of items before hand.
Upvotes: 0
Reputation: 10825
I understand that you probably don't need this answer, so this just for the record: environments are not that slow, you just need to convert them to list "properly".
This is your code, for reference. Yes, this is slow.
system.time({
env <- new.env()
for(i in 1:1e5){
env[[sprintf("%09d", i)]] <- list("foo" = rnorm(3), bar = TRUE)
}
})
#> user system elapsed
#> 1.583 0.034 1.632
system.time(
x <- lapply(sort(ls(env)), get, env, inherits = FALSE)
)
#> user system elapsed
#> 1.595 0.014 1.629
A slightly faster way to put in the elements:
system.time({
env <- new.env()
for(i in 1:1e5){
env[[as.character(i)]] <- list("foo" = rnorm(3), bar = TRUE)
}
})
#> user system elapsed
#> 1.039 0.023 1.072
Not as fast as pre-allocated lists, but almost:
system.time({
l <- list()
length(l) <- 1e5
for(i in 1:1e5){
l[[i]] <- list("foo" = rnorm(3), bar = TRUE)
}
})
#> user system elapsed
#> 0.870 0.013 0.889
A much faster way to convert the environment to a sorted list:
system.time({
x <- as.list(env, sorted = FALSE)
x <- x[order(as.numeric(names(x)))]
})
#> user system elapsed
#> 0.073 0.000 0.074
If this is fast enough for you, then it is a lot easier than C code and/or re-allocating storage.
Upvotes: 4
Reputation: 4117
A little while back I did some experiments with implementing stacks and queues with pairlists vs. lists in R, and I put them in this package: https://github.com/wch/qstack. I've added some benchmarks to the README.
The short version: using a pairlist isn't really faster than using a list and doubling as it grows. Also:
Upvotes: 3
Reputation: 32978
Below a C implementation of c()
for two pairlists.
#include <Rinternals.h>
SEXP C_join_pairlist(SEXP x, SEXP y) {
if(!isPairList(x) || !isPairList(y))
Rf_error("x and y must be pairlists");
//special case
if(x == R_NilValue)
return y;
//find the tail of x
SEXP tail = x;
while(CDR(tail) != R_NilValue)
tail = CDR(tail);
//append to tail
SETCDR(tail, y);
return x;
}
And a simple R wrapper:
join_pairlist <- function(x, values){
.Call(C_join_pairlist, x, values)
}
You use it like this:
> x <- pairlist("foo", "bar")
> y <- pairlist("baz", "bla", "boe")
> x <- join_pairlist(x,y)
[1] TRUE
> print(x)
[[1]]
[1] "foo"
[[2]]
[1] "bar"
[[3]]
[1] "baz"
[[4]]
[1] "bla"
[[5]]
[1] "boe"
This is efficient, but also dangerous because it changes the value of x
without duplicating it. It is quite easy to accidentally introduce circular references this way.
Upvotes: 3
Reputation: 21315
I think you need to dive into C / C++ to do this most efficiently -- R doesn't really provide any facilities at the R language level for modifying things in place (including pairlists, but excepting environments), so I would suggest:
Using e.g. C++ STL containers, which can grow efficiently, then coercing that back to whatever output you need once you have it, or
Just using plain old R pairlist
s, which you can interact with and extend fairly readily at the C level, and then coercing that at the end (if necessary).
You can, of course, use method (1) yourself with plain R by making something like a 'growable' vectors (e.g. track the capacity, double it whenever necessary, and then shrink to fit at the end) but generally when you need this much low level control it's worth dipping into C / C++.
Upvotes: 4
Reputation: 94222
This is slow:
> x <- list()
> for(i in 1:1e5){x[[i]]=list(foo=rnorm(3),bar=TRUE)}
I gave up waiting. But this is fast, almost instantaneous:
> x <- list()
> length(x)=1e5
> for(i in 1:1e5){x[[i]]=list(foo=rnorm(3),bar=TRUE)}
So I reckon the way to go is to extend the list length by 10000 every time, and prune it back when you get to the last element and know the final output.
> length(x)=2e5 # extend by another 1e5
> for(i in 1:1e5){x[[i+1e5]]=list(foo=rnorm(3),bar=TRUE)}
> length(x)=3e5 # and again... but this time only 100 more elts:
> for(i in 1:100){x[[i+2e5]]=list(foo=rnorm(3),bar=TRUE)}
> length(x) = 2e5 + 100
Another approach is to double the list size each time you need more elements.
Upvotes: 6