Reputation: 69

R - save as data.frame all elements of a list of lists efficiently

I have the following list, and I want to create a data.frame that holds every possible "path", when the numeric value of the arrays is> 0.

This is the list:

> ABBCCD2
$A1
$A1$B1
      D1    D2
C1 0.233 0.078
C2 0.039 0.039

$A1$B2
      D1    D2
C1 0.083 0.028
C2 0.056 0.056

$A1$B3
      D1    D2
C1 0.083 0.028
C2 0.056 0.056


$A2
$A2$B1
      D1    D2
C1 0.100 0.033
C2 0.017 0.017

$A2$B2
   D1 D2
C1  0  0
C2  0  0

$A2$B3
   D1 D2
C1  0  0
C2  0  0

And this is the result I want:

> res
   FUN INTC INTB INME  prob
1   A1   B1   C1   D1 0.233
2   A1   B1   C1   D2 0.078
3   A1   B1   C2   D1 0.039
4   A1   B1   C2   D2 0.039
5   A1   B2   C1   D1 0.083
6   A1   B2   C1   D2 0.028
7   A1   B2   C2   D1 0.056
8   A1   B2   C2   D2 0.056
9   A1   B3   C1   D1 0.083
10  A1   B3   C1   D2 0.028
11  A1   B3   C2   D1 0.056
12  A1   B3   C2   D2 0.056
13  A2   B1   C1   D1 0.100
14  A2   B1   C1   D2 0.033
15  A2   B1   C2   D1 0.017
16  A2   B1   C2   D2 0.017

I have solved it with for loops, but it is not efficient, since the real problem that I have there are 15 million possible paths, and it can take several days to solve it. This is the code I have made:

m <- 0

# creamos dataframe vacio
res <- data.frame(FUN=character(),INTC=character(),INTB=character(),INME=character(),prob=numeric())


for(i in 1:length(ABBCCD2)) { # A
  
 
  for (j in 1:length(ABBCCD2[[1]])) {  # B
    
    
    for(k in 1:nrow(ABBCCD2[[1]][[1]])) {  # C
      
      
      for(f in 1:ncol(ABBCCD2[[1]][[1]])) {  # D
       
        
        # solo guardamos las prob > 0
        if(ABBCCD2[[i]][[j]][k,f] > 0) {
        
        
            # contador de caminos con probabilidad no-cero
            m <- m + 1
            
            
            # creamos la fila del data frame correspondiente y vamos rellenando
            res[m,] <- data.frame(FUN=names(ABBCCD2[i]), INTC=names(ABBCCD2[[i]][j]), INTB=rownames(ABBCCD2[[i]][[j]])[k], 
                                 
                                 INME = colnames(ABBCCD2[[i]][[j]])[f] , prob = ABBCCD2[[i]][[j]][k,f] )
        }else{
          
        }

      }
    }
  }
}

Any ideas to solve it more efficiently?

Thank you all

Upvotes: 3

Answers (3)

Uwe

Reputation: 42544

If I understand correctly, the challenges are

to convert the matrices to data.frames,
to skip zero matrices,
to bind all pieces from the nested lists into one large dataset,
to reshape into long format,
to retain the names of list elements and matrix dimensions.

(Execution not necessarily in that order)

An additional challenge was that the question shows the nested list in printed form but not in a reproducible form, e.g., dput(). See the Data section on turning the printout into a list structure.

For the sake of completeness, here are two other approaches.

nested lapply() and rbindlist()
rrapply::rrapply() and reshape2::melt()

Nested `lapply()` and `rbindlist()`

library(data.table)
library(magrittr)
res <- lapply(
  ABBCCD2, 
  function(x) lapply(x, as.data.table, keep.rownames = "INTB") %>% rbindlist(idcol = "INTC")
) %>% 
  rbindlist(idcol = "FUN") %>% 
  melt(measure.vars = patterns("^D"), variable.name = "INME", value.name = "prob") %>% 
  .[prob != 0] %>%
  setorderv(names(.))
res

    FUN INTC INTB INME  prob
 1:  A1   B1   C1   D1 0.233
 2:  A1   B1   C1   D2 0.078
 3:  A1   B1   C2   D1 0.039
 4:  A1   B1   C2   D2 0.039
 5:  A1   B2   C1   D1 0.083
 6:  A1   B2   C1   D2 0.028
 7:  A1   B2   C2   D1 0.056
 8:  A1   B2   C2   D2 0.056
 9:  A1   B3   C1   D1 0.083
10:  A1   B3   C1   D2 0.028
11:  A1   B3   C2   D1 0.056
12:  A1   B3   C2   D2 0.056
13:  A2   B1   C1   D1 0.100
14:  A2   B1   C1   D2 0.033
15:  A2   B1   C2   D1 0.017
16:  A2   B1   C2   D2 0.017

magrittr piping is used to improve readability.

This approach converts the single 2 x 2 matrices into data.tables with 3 columns and 2 rows each. These are then combined by rbindlist() in two steps to form one large data.table. Finally, the two value columns are reshaped to long format and zero prob values are removed.

setorderv() is only used to allow for a direct comparison with OP's expected result.

Caveat: Zero prob values are removed after all data have been turned into long format. This may lead to unexpected results in case one of the matrices contains a zero element just by chance.

rrapply() and matrix `melt()`

Here is a different approach which first reshapes the matrices to long form data.tables (after excluding matrices with all zero elements) which are then combined into one large dataset by two rbindlist() steps:

library(data.table)
library(magrittr)
library(rrapply)
res2 <- rrapply(ABBCCD2, 
                condition = function(x) sum(abs(x)) > 0, 
                f = function(x) reshape2::melt(x, value.name = "prob"), 
                classes = "matrix", how = "prune") %>% 
  lapply(rbindlist, idcol = "INTC") %>% 
  rbindlist(idcol = "FUN") %>% 
  setnames(c("Var1", "Var2"), c("INTB", "INME"))%>%
  setorderv(names(.))
res2

The result is the same as above.

Data

Here is a way to turn the printout into a nested list structure:

txt <- "$A1
$A1$B1
D1    D2
C1 0.233 0.078
C2 0.039 0.039

$A1$B2
D1    D2
C1 0.083 0.028
C2 0.056 0.056

$A1$B3
D1    D2
C1 0.083 0.028
C2 0.056 0.056


$A2
$A2$B1
D1    D2
C1 0.100 0.033
C2 0.017 0.017

$A2$B2
D1 D2
C1  0  0
C2  0  0

$A2$B3
D1 D2
C1  0  0
C2  0  0"

txt contains the printout as copied and pasted from the question

library(data.table)
library(magrittr)
library(rrapply)

ABBCCD2 <- fread(text = txt, sep = NULL, header = FALSE, blank.lines.skip = TRUE) %>% 
  .[, tstrsplit(V1, "\\$")] %>% 
  .[, c("V2", "V3") := zoo::na.locf(.SD, na.rm = FALSE), .SDcols = c("V2", "V3")] %>% 
  .[V1 != ""] %>% 
  split(by = c("V2", "V3"), flatten = FALSE, keep.by = FALSE) %>% 
  rrapply(
    f = . %>% 
      .[, paste0(V1, collapse = "\n") %>% 
          {paste("rn", .)} %>% 
          fread() %>% 
          as.matrix(rownames = "rn")]
    , classes = "data.frame", how = "replace")


ABBCCD2

$A1
$A1$B1
      D1    D2
C1 0.233 0.078
C2 0.039 0.039

$A1$B2
      D1    D2
C1 0.083 0.028
C2 0.056 0.056

$A1$B3
      D1    D2
C1 0.083 0.028
C2 0.056 0.056


$A2
$A2$B1
      D1    D2
C1 0.100 0.033
C2 0.017 0.017

$A2$B2
   D1 D2
C1  0  0
C2  0  0

$A2$B3
   D1 D2
C1  0  0
C2  0  0

Upvotes: 3

ThomasIsCoding

Reputation: 101335

Here is a base R option using stack

rev(
  transform(
    stack(df <- as.data.frame(
      rapply(ABBCCD2,
        t,
        how = "replace"
      )
    )),
    ind = paste0(ind, ".", row.names(df))
  )
)

which gives

           ind     values
1  A1.B1.C1.D1 -1.0978872
2  A1.B1.C1.D2  0.1646925
3  A1.B1.C2.D1 -0.8782714
4  A1.B1.C2.D2  1.2239280
5  A1.B2.C1.D1  0.2088934
6  A1.B2.C1.D2 -1.6247005
7  A1.B2.C2.D1  0.2191693
8  A1.B2.C2.D2 -0.4496129
9  A2.B1.C1.D1  0.3426282
10 A2.B1.C1.D2  1.8424623
11 A2.B1.C2.D1 -1.0963979
12 A2.B1.C2.D2 -0.2248845
13 A2.B2.C1.D1 -0.9655256
14 A2.B2.C1.D2 -1.2647063
15 A2.B2.C2.D1  0.6998366
16 A2.B2.C2.D2  0.4514344
17 A2.B3.C1.D1  0.0000000
18 A2.B3.C1.D2  0.0000000
19 A2.B3.C2.D1  0.0000000
20 A2.B3.C2.D2  0.0000000

Upvotes: 1

akrun

Reputation: 887108

Here is an option

library(rrapply)
library(purrr)
library(dplyr)
library(tidyr)
map_depth(ABBCCD2, 2, ~ as.data.frame.table(.x)) %>% 
  map_dfr(~ bind_rows(.x, .id = 'INTC'), .id = 'FUN') %>% 
  rename_at(3:5, ~c("INTB", "INME", "prob"))  %>% 
  filter(prob != 0)

-output

#    FUN INTC INTB INME       prob
#1   A1   B1   C1   D1 -1.0978872
#2   A1   B1   C2   D1 -0.8782714
#3   A1   B1   C1   D2  0.1646925
#4   A1   B1   C2   D2  1.2239280
#5   A1   B2   C1   D1  0.2088934
#6   A1   B2   C2   D1  0.2191693
#7   A1   B2   C1   D2 -1.6247005
#8   A1   B2   C2   D2 -0.4496129
#9   A2   B1   C1   D1  0.3426282
#10  A2   B1   C2   D1 -1.0963979
#11  A2   B1   C1   D2  1.8424623
#12  A2   B1   C2   D2 -0.2248845
#13  A2   B2   C1   D1 -0.9655256
#14  A2   B2   C2   D1  0.6998366
#15  A2   B2   C1   D2 -1.2647063
#16  A2   B2   C2   D2  0.4514344

data

ABBCCD2 <- list(A1 = list(B1 = structure(c(-1.0978871935389, -0.878271447742256, 
0.164692499183084, 1.22392804082201), .Dim = c(2L, 2L), .Dimnames = list(
    c("C1", "C2"), c("D1", "D2"))), B2 = structure(c(0.208893448902667, 
0.21916929248291, -1.62470051990683, -0.449612869059051), .Dim = c(2L, 
2L), .Dimnames = list(c("C1", "C2"), c("D1", "D2")))), A2 = list(
    B1 = structure(c(0.34262819072166, -1.09639792471103, 1.8424623311698, 
    -0.224884516346163), .Dim = c(2L, 2L), .Dimnames = list(c("C1", 
    "C2"), c("D1", "D2"))), B2 = structure(c(-0.965525564286861, 
    0.699836580462635, -1.26470634026811, 0.451434438203962), .Dim = c(2L, 
    2L), .Dimnames = list(c("C1", "C2"), c("D1", "D2"))), B3 = structure(c(0, 
    0, 0, 0), .Dim = c(2L, 2L), .Dimnames = list(c("C1", "C2"
    ), c("D1", "D2")))))

Upvotes: 3

R - save as data.frame all elements of a list of lists efficiently

Answers (3)

Nested lapply() and rbindlist()

rrapply() and matrix melt()

Data

data

Related Questions

Nested `lapply()` and `rbindlist()`

rrapply() and matrix `melt()`