Zelazny7
Zelazny7

Reputation: 40618

Flatten a list with complex nested structure

I have a list with the following example structure:

> dput(test)
structure(list(id = 1, var1 = 2, var3 = 4, section1 = structure(list(
    var1 = 1, var2 = 2, var3 = 3), .Names = c("var1", "var2", 
"var3")), section2 = structure(list(row = structure(list(var1 = 1, 
    var2 = 2, var3 = 3), .Names = c("var1", "var2", "var3")), 
    row = structure(list(var1 = 4, var2 = 5, var3 = 6), .Names = c("var1", 
    "var2", "var3")), row = structure(list(var1 = 7, var2 = 8, 
        var3 = 9), .Names = c("var1", "var2", "var3"))), .Names = c("row", 
"row", "row"))), .Names = c("id", "var1", "var3", "section1", 
"section2"))


> str(test)
List of 5
 $ id      : num 1
 $ var1    : num 2
 $ var3    : num 4
 $ section1:List of 3
  ..$ var1: num 1
  ..$ var2: num 2
  ..$ var3: num 3
 $ section2:List of 3
  ..$ row:List of 3
  .. ..$ var1: num 1
  .. ..$ var2: num 2
  .. ..$ var3: num 3
  ..$ row:List of 3
  .. ..$ var1: num 4
  .. ..$ var2: num 5
  .. ..$ var3: num 6
  ..$ row:List of 3
  .. ..$ var1: num 7
  .. ..$ var2: num 8
  .. ..$ var3: num 9

Notice that the section2 list contains elements named rows. These represent multiple records. What I have is a nested list where some elements are at the root level and others are multiple nested records for the same observation. I would like the following output in a data.frame format:

> desired
  id var1 var3 section1.var1 section1.var2 section1.var3 section2.var1 section2.var2 section2.var3
1  1    2    4             1             2               3             1             4             7
2 NA   NA   NA            NA            NA              NA             2             5             8
3 NA   NA   NA            NA            NA              NA             3             6             9

Root-level elements should populate the first row, while row elements should have their own rows. As an added complication, the number of variables in the row entries can vary.

Upvotes: 5

Views: 1917

Answers (4)

eddi
eddi

Reputation: 49448

This starts similarly to tiffany's answer, but diverges a bit afterwards.

library(data.table)

# flatten the first level
flat = unlist(test, recursive = FALSE)

# compute max length
N = max(sapply(flat, length))

# pad NA's and convert to data.table (at this point it will *look* like the right answer)
dt = as.data.table(lapply(flat, function(l) c(l, rep(NA, N - length(l)))))

# but in reality some of the columns are lists - check by running sapply(dt, class)
# so unlist them
dt = dt[, lapply(.SD, unlist)]
#   id var1 var3 section1.var1 section1.var2 section1.var3 section2.row section2.row section2.row
#1:  1    2    4             1             2             3            1            4            7
#2: NA   NA   NA            NA            NA            NA            2            5            8
#3: NA   NA   NA            NA            NA            NA            3            6            9

Upvotes: 0

tiffany
tiffany

Reputation: 503

Here's a general approach. It doesn't assume that you'll have only three row; it will work with however many rows you have. And if a value is missing in the nested structure (e.g. var1 doesn't exist for some sub-lists in section2), the code correctly returns an NA for that cell.

E.g. if we use the following data:

test <- structure(list(id = 1, var1 = 2, var3 = 4, section1 = structure(list(var1 = 1, var2 = 2, var3 = 3), .Names = c("var1", "var2", "var3")), section2 = structure(list(row = structure(list(var1 = 1, var2 = 2), .Names = c("var1", "var2")), row = structure(list(var1 = 4, var2 = 5), .Names = c("var1", "var2")), row = structure(list( var2 = 8, var3 = 9), .Names = c("var2", "var3"))), .Names = c("row", "row", "row"))), .Names = c("id", "var1", "var3", "section1", "section2"))

The general approach is to use melt to create a dataframe that includes information about the nested structure, and then dcast to mold it into the format you desire.

library("reshape2")

flat <- unlist(test, recursive=FALSE)
names(flat)[grep("row", names(flat))] <- gsub("row", "var", paste0(names(flat)[grep("row", names(flat))], seq_len(length(names(flat)[grep("row", names(flat))]))))  ## keeps track of rows by adding an ID
ul <- melt(unlist(flat))
split <- strsplit(rownames(ul), split=".", fixed=TRUE) ## splits the names into component parts
max <- max(unlist(lapply(split, FUN=length)))
pad <- function(a) {
  c(a, rep(NA, max-length(a)))
}
levels <- matrix(unlist(lapply(split, FUN=pad)), ncol=max, byrow=TRUE)

## Get the nesting structure
nested <- data.frame(levels, ul)
nested$X3[is.na(nested$X3)] <- levels(as.factor(nested$X3))[[1]]
desired <- dcast(nested, X3~X1 + X2)
names(desired) <- gsub("_", "\\.", gsub("_NA", "", names(desired)))
desired <- desired[,names(flat)]

> desired
  ## id var1 var3 section1.var1 section1.var2 section1.var3 section2.var1 section2.var2 section2.var3
## 1  1    2    4             1             2             3             1             4             7
## 2 NA   NA   NA            NA            NA            NA             2             5             8
## 3 NA   NA   NA            NA            NA            NA             3             6             9

Upvotes: 4

Jthorpe
Jthorpe

Reputation: 10167

Since your problem is not well defined when rows have complex structures (i.e. if each row in test contained the list test`, how should rows be bound together. Also what if rows in the same table have different structures?), the following solution depends on rows being a list of values.

That said, I'm guessing that in the general case, your list test will contain either values, lists of values, or lists of rows (where rows are lists of values). Also, if rows aren't always called "row" this solution still works.

temp <- lapply(test,
                function(x){
                    if(!is.list(x))
                        # x is a value
                        return(x)
                    # x is a lis of rows or values
                    out <- do.call(cbind,x)
                    if(nrow(out)>1){
                        # x is a list of rows 
                        colnames(out)<-paste0(colnames(out),'.',rownames(out))
                        rownames(out)<-rep_len(NA,nrow(out))
                    }
                    return(out)
                })

# a function that extends a matrix to a fixt number of rows (n)
# by appending rows of NA's 
rowExtend  <-  function(x,N){
                 if((!is.matrix(x)) ){
                     out<-do.call(rbind,c(list(x),as.list(rep_len(NA,N - 1))))
                     colnames(out) <- ""
                     out
                 }else if(nrow(x) < N)
                     do.call(rbind,c(list(x),as.list(rep_len(NA,N - nrow(x)))))
                 else
                     x
             }

# calculate the maximum number of rows
.nrows <- sapply(temp,nrow)
.nrows <- max(unlist(.nrows[!sapply(.nrows,is.null)]))

# extend the shorter rows
(temp2<-lapply(temp, rowExtend,.nrows))

# calculate new column namames
newColNames <- mapply(function(x,y) {
                       if(nzchar(y)[1L])
                           paste0(x,'.',y)
                       else x
                        },
                       names(temp2),
                       lapply(temp2,colnames))


do.call(cbind,mapply(`colnames<-`,temp2,newColNames))

#> id var1 var3 section1.var1 section1.var2 section1.var3 section2.row.var1 section2.row.var2 section2.row.var3
#> 1  2    4    1             2             3             1                 4                 7                
#> NA NA   NA   NA            NA            NA            2                 5                 8                
#> NA NA   NA   NA            NA            NA            3                 6                 9                

Upvotes: 0

Marat Talipov
Marat Talipov

Reputation: 13304

The central idea of this solution is to flatten all sub-lists except the sub-lists named 'row'. This could be done by creating a unique ID for each list element (stored in z) and then requesting that all elements within a single 'row' should have the same ID (stored in z2; had to write a recursive function to traverse the nested list). Then, z2 could be used to group elements that belong to the same row. The resulting list can be converted into the matrix form using stri_list2matrix from the stringi package, and then converted into a data frame.

utest <- unlist(test)
z <- relist(seq_along(utest),test)

recurse <- function(L) {
    if (class(L)!='list') return(L)
    b <- names(L)=='row'
    L.b <- lapply(L[b],function(k) relist(rep(k[[1]],length(k)),k))
    L.nb <- lapply(L[!b],recurse)
    c(L.b,L.nb)
}

z2 <- unlist(recurse(z))

library(stringi)
desired <- as.data.frame(stri_list2matrix(split(utest,z2)))
names(desired) <- names(z2)[unique(z2)]

desired
#     id var1 var3 section1.var1 section1.var2 section1.var3 section2.row.var1
# 1    1    2    4             1             2             3                 1
# 2 <NA> <NA> <NA>          <NA>          <NA>          <NA>                 2
# 3 <NA> <NA> <NA>          <NA>          <NA>          <NA>                 3
#   section2.row.var1 section2.row.var1
# 1                 4                 7
# 2                 5                 8
# 3                 6                 9

Upvotes: 1

Related Questions