Jeroen Ooms
Jeroen Ooms

Reputation: 32978

Efficient linked list (ordered set) in R

When reading records from a database cursor it is often unknown in advance how many rows are coming up. This makes it impossible to pre-allocate a list of the right size to store these objects.

What is an efficient way of storing all records in a list, when we the total size is unknown? The basic list type is slow because it will copy the entire list every time an element is appended:

x <- list()
for(i in 1:1e5){
  x[[i]] <- list("foo" = rnorm(3), bar = TRUE)
}

An environment is more efficient, but it is a map rather than an ordered set. So we need to convert the index into a string and then later sort the keys to retrieve the values, which seems suboptimal:

env <- new.env()
for(i in 1:1e5){
  env[[sprintf("%09d", i)]] <- list("foo" = rnorm(3), bar = TRUE)
}
x <- lapply(sort(ls(env)), get, env, inherits = FALSE)

A pairlist is supposed to be a linked list in R, however R seems to coerse it into a regular list whenever an element is appended from within R.

Upvotes: 9

Views: 488

Answers (6)

JanKanis
JanKanis

Reputation: 6682

An example of a repeatedly doubling list implementation is in my answer at https://stackoverflow.com/a/32870310/264177. It is not implemented as linked list but as an expanding array. It is quite a bit faster than other alternatives for large data sets.

I actually built that for the same problem you are describing here, storing a large number of items retrieved from a database where you do not know the number of items before hand.

Upvotes: 0

Gabor Csardi
Gabor Csardi

Reputation: 10825

I understand that you probably don't need this answer, so this just for the record: environments are not that slow, you just need to convert them to list "properly".

This is your code, for reference. Yes, this is slow.

system.time({
  env <- new.env()
  for(i in 1:1e5){
    env[[sprintf("%09d", i)]] <- list("foo" = rnorm(3), bar = TRUE)
  }
})
#>    user  system elapsed 
#>   1.583   0.034   1.632 

system.time(
  x <- lapply(sort(ls(env)), get, env, inherits = FALSE)
)
#>    user  system elapsed 
#>   1.595   0.014   1.629 

A slightly faster way to put in the elements:

system.time({
  env <- new.env()
  for(i in 1:1e5){
    env[[as.character(i)]] <- list("foo" = rnorm(3), bar = TRUE)
  }
})
#>    user  system elapsed 
#>   1.039   0.023   1.072 

Not as fast as pre-allocated lists, but almost:

system.time({
  l <- list()
  length(l) <- 1e5
  for(i in 1:1e5){
    l[[i]] <- list("foo" = rnorm(3), bar = TRUE)
  }
})

#>    user  system elapsed 
#>   0.870   0.013   0.889 

A much faster way to convert the environment to a sorted list:

system.time({
  x <- as.list(env, sorted = FALSE)
  x <- x[order(as.numeric(names(x)))]
})

#>    user  system elapsed 
#>   0.073   0.000   0.074 

If this is fast enough for you, then it is a lot easier than C code and/or re-allocating storage.

Upvotes: 4

wch
wch

Reputation: 4117

A little while back I did some experiments with implementing stacks and queues with pairlists vs. lists in R, and I put them in this package: https://github.com/wch/qstack. I've added some benchmarks to the README.

The short version: using a pairlist isn't really faster than using a list and doubling as it grows. Also:

  • It's dangerous to directly modify a pairlist that's used by any other R code, because R assumes that pairlists are copy-on-modification.
  • Lists are more memory efficient because items don't need pointers to the next item.
  • Random access in a list is much faster than in a pairlist.
  • Pairlists can be faster if you need to insert or remove items in the middle of them (using C).

Upvotes: 3

Jeroen Ooms
Jeroen Ooms

Reputation: 32978

Below a C implementation of c() for two pairlists.

#include <Rinternals.h>

SEXP C_join_pairlist(SEXP x, SEXP y) {
  if(!isPairList(x) || !isPairList(y))
    Rf_error("x and y must be pairlists");

  //special case
  if(x == R_NilValue)
    return y;

  //find the tail of x
  SEXP tail = x;
  while(CDR(tail) != R_NilValue)
    tail = CDR(tail);

  //append to tail
  SETCDR(tail, y);
  return x;
}

And a simple R wrapper:

join_pairlist <- function(x, values){
  .Call(C_join_pairlist, x, values)
}

You use it like this:

> x <- pairlist("foo", "bar")
> y <- pairlist("baz", "bla", "boe")
> x <- join_pairlist(x,y)
[1] TRUE

> print(x)
[[1]]
[1] "foo"

[[2]]
[1] "bar"

[[3]]
[1] "baz"

[[4]]
[1] "bla"

[[5]]
[1] "boe"

This is efficient, but also dangerous because it changes the value of x without duplicating it. It is quite easy to accidentally introduce circular references this way.

Upvotes: 3

Kevin Ushey
Kevin Ushey

Reputation: 21315

I think you need to dive into C / C++ to do this most efficiently -- R doesn't really provide any facilities at the R language level for modifying things in place (including pairlists, but excepting environments), so I would suggest:

  1. Using e.g. C++ STL containers, which can grow efficiently, then coercing that back to whatever output you need once you have it, or

  2. Just using plain old R pairlists, which you can interact with and extend fairly readily at the C level, and then coercing that at the end (if necessary).

You can, of course, use method (1) yourself with plain R by making something like a 'growable' vectors (e.g. track the capacity, double it whenever necessary, and then shrink to fit at the end) but generally when you need this much low level control it's worth dipping into C / C++.

Upvotes: 4

Spacedman
Spacedman

Reputation: 94222

This is slow:

 > x <- list()
 > for(i in 1:1e5){x[[i]]=list(foo=rnorm(3),bar=TRUE)}

I gave up waiting. But this is fast, almost instantaneous:

 > x <- list()
 > length(x)=1e5
 > for(i in 1:1e5){x[[i]]=list(foo=rnorm(3),bar=TRUE)}

So I reckon the way to go is to extend the list length by 10000 every time, and prune it back when you get to the last element and know the final output.

 > length(x)=2e5  # extend by another 1e5
 > for(i in 1:1e5){x[[i+1e5]]=list(foo=rnorm(3),bar=TRUE)}
 > length(x)=3e5  # and again... but this time only 100 more elts:
 > for(i in 1:100){x[[i+2e5]]=list(foo=rnorm(3),bar=TRUE)}
 > length(x) = 2e5 + 100

Another approach is to double the list size each time you need more elements.

Upvotes: 6

Related Questions