Sim
Sim

Reputation: 13528

High-performance big data manipulation in R

I am dealing with a collection of lists, which contain deeply nested lists with no fixed structure other than the fact that:

  1. The lists at level 1 have a single element called variations
  2. All leaf data in the hierarchy is numeric.

For example:

list(
  list(variations = list(
    '12'   = list(x = c(a = 1))
    )),
  list(variations = list(
    '3'    = list(x = c(a = 6, b = 4)),
    'abcd' = list(x = c(b = 1), m = list(n = list(o = c(p = 1023))))
    ))
  )

I need to convert the list data structure into a melted (per reshape) dataframe of the form

data.frame(
  variation = c( '12',   '3',   '3', 'abcd',    'abcd'),
  variable  = c('x.a', 'x.a', 'x.b',  'x.b', 'm.n.o.p'),
  value     = c(    1,     6,     4,      1,      1023)
  )

or another data structure I can perform fast grouping and filtering on.

There are many millions of nodes in the data structure. The collection can have thousands of entries and each entry has tens of thousands of variations with 2-10+ leaf nodes with unknown names.

I am looking for suggestions on how to build the dataframe from the collection in a fast way.

One approach would be to use unlist on the source data to flatten the lists but I am not sure about the following:

Regardless of whether unlist is the right way to go, I'm wondering:

Upvotes: 5

Views: 330

Answers (1)

Ari B. Friedman
Ari B. Friedman

Reputation: 72741

There's a function that doesn't seem to get used much called rapply which recursively operates on lists. I have no idea how fast it is (based on lapply, so probably not terrible but not amazing), and it's tricky to use. But worth considering, if only for elegance.

Here's one basic example of its use:

> rapply( test, classes="numeric", how="unlist", f=function(var) data.frame(names(var),var) )
      variations.12.x.names.var.              variations.12.x.var       variations.3.x.names.var.1       variations.3.x.names.var.2              variations.3.x.var1 
                             "a"                              "1"                              "a"                              "b"                              "6" 
             variations.3.x.var2     variations.abcd.x.names.var.            variations.abcd.x.var variations.abcd.m.n.o.names.var.        variations.abcd.m.n.o.var 
                             "4"                              "b"                              "1"                              "p"                           "1023" 

Upvotes: 3

Related Questions