deschen
deschen

Reputation: 11016

bind_rows on a list fails if some sub-list elements are empty. Why?

Assuming the following list:

x <- list(list(q = 1880L, properties = list(), last_Import_Date = "2024-09-16"), 
          list(q = 1888L, properties = list(list(a = "x", b = "y")), last_Import_Date = "2024-09-16"),
          list(q = 1890L, properties = list(list(a = "x", b = "y")), last_Import_Date = "2024-09-16"))

I want to convert this list into a data frame (rowwise). Usually, dplyr::bind_rows works well. However, my list has some elements that are sometimes empty ("properties"), in which case bind_rows fails in a way that it only keeps those rows that are not empty.

Can someone explain why that is?

And is there any (short) fix for it? I'm currently using rather ugly workarounds using list2DF, then transposing, then converting to data frame, then assigning names.

Wrong results (only keep non-empty properties):

x |>
  bind_rows()

# A tibble: 2 × 3
      q properties       last_Import_Date
  <int> <list>           <chr>           
1  1888 <named list [2]> 2024-09-16      
2  1890 <named list [2]> 2024-09-16 

UPDATE: where I need some additional help is with unnesting such a special "properties" column. Using unnest_longer will result in the same "bug" that deletes the NULL row, and using unnest_wider requires some extra workaround for fixing names.

Upvotes: 8

Views: 517

Answers (3)

one
one

Reputation: 3902

bind_rows uses vctrs::data_frame under the hood. It turns out vctrs::data_frame creates empty dataframe when there is an element with 0 length (i.e. list(0), integer(0), character(0).etc):

vctrs::data_frame(!!!list(q = 1880L, properties = list(), last_Import_Date = "2024-09-16"),.name_repair="unique")
[1] q                properties       last_Import_Date
<0 rows> (or 0-length row.names)

vctrs::data_frame(a=list("a"),b= integer(0))
[1] a b
<0 rows> (or 0-length row.names)

vctrs::data_frame(a=list(),b= 1)
[1] a b
<0 rows> (or 0-length row.names)

One alternative is to use vctrs::vec_rbind:

vctrs::vec_rbind(!!!x)
     q properties last_Import_Date
1 1880       NULL       2024-09-16
2 1888       x, y       2024-09-16
3 1890       x, y       2024-09-16

Upvotes: 5

G. Grothendieck
G. Grothendieck

Reputation: 270045

1) bind_rows bind_rows will work if you pre and post process the input like this:

library(dplyr)
x |> lapply(unlist) |> bind_rows() |> type.convert(as.is = TRUE)

## # A tibble: 3 × 4
##       q last_Import_Date properties.a properties.b
##   <int> <chr>            <chr>        <chr>       
## 1  1880 2024-09-16       <NA>         <NA>        
## 2  1888 2024-09-16       x            y           
## 3  1890 2024-09-16       x            y           

2) transpose Transposing x and then removing the extra layer of lists in properties allows us to use hoist to hoist a and b from properties.

library(purrr)
library(tidyr)
x |>
  transpose() |>
  list2DF() |>
  transform(properties = lapply(properties, unlist)) |>
  hoist(properties, "a", "b")

##      q    a    b last_Import_Date
## 1 1880 <NA> <NA>       2024-09-16
## 2 1888    x    y       2024-09-16
## 3 1890    x    y       2024-09-16

3) Base R If a list column for properties is sufficient then this double iteration uses only base R:

Map(\(z) sapply(x, "[[", z), names(x[[1]])) |> list2DF()

##      q properties last_Import_Date
## 1 1880       NULL       2024-09-16
## 2 1888       x, y       2024-09-16
## 3 1890       x, y       2024-09-16

4) rrapply rrapply can create the data frame directly:

library(rrapply)
rrapply(x, how = "bind")

##      q last_Import_Date properties.1.a properties.1.b
## 1 1880       2024-09-16           <NA>           <NA>
## 2 1888       2024-09-16              x              y
## 3 1890       2024-09-16              x              y

5) Recursive This base R solution is longer than the others but maybe it is of interest anyways. We define getField which given a list that represents a row finds and returns the value of the input field name (argument field) or NA if none found. Map iterates over Names (q, a, b, last_Date_Modified). It uses sapply to iterate over the rows for a given name.

getField <- function(x, field) {
  ret <- NA
  if (is.list(x)) {
    if (field %in% names(x)) ret <- x[[field]]
    else for(el in x) if (!is.na(ret <- Recall(el, field))) break
  } 
  ret
}

# Names <- c("q", "a", "b", "last_Import_Date")
Names <- sub(".*\\.", "", unique(names(unlist(x))))
Map(\(fld) sapply(x, getField, field = fld), Names) |> list2DF()

##      q last_Import_Date    a    b
## 1 1880       2024-09-16 <NA> <NA>
## 2 1888       2024-09-16    x    y
## 3 1890       2024-09-16    x    y

Upvotes: 5

ThomasIsCoding
ThomasIsCoding

Reputation: 102529

Update

If you want to use unnest without removing empty entries in properties, you should specify the option keep_empty = TRUE (based on @one's vec_rbind approach)

vctrs::vec_rbind(!!!x) %>%
    unnest(cols = everything(), keep_empty = TRUE)

which gives

# A tibble: 3 × 3
      q properties       last_Import_Date
  <int> <list>           <chr>
1  1880 <NULL>           2024-09-16
2  1888 <named list [2]> 2024-09-16
3  1890 <named list [2]> 2024-09-16 

and its base R equivalence might be

list2DF(
    lapply(
        as.data.frame(do.call(rbind, x)),
        \(v) unlist(replace(v, lengths(v) == 0, list(list(NULL))), FALSE)
    )
) 

which gives

     q properties last_Import_Date
1 1880       NULL       2024-09-16
2 1888       x, y       2024-09-16
3 1890       x, y       2024-09-16

and the structure looks like

'data.frame':   3 obs. of  3 variables:
 $ q               : int  1880 1888 1890
 $ properties      :List of 3
  ..$ : NULL
  ..$ :List of 2
  .. ..$ a: chr "x"
  .. ..$ b: chr "y"
  ..$ :List of 2
  .. ..$ a: chr "x"
  .. ..$ b: chr "y"
 $ last_Import_Date: chr  "2024-09-16" "2024-09-16" "2024-09-16"

Older (quick fixup)

Here is a base R quick fix

> as.data.frame(do.call(rbind, x))
     q properties last_Import_Date
1 1880       NULL       2024-09-16
2 1888       x, y       2024-09-16
3 1890       x, y       2024-09-16

and its structure looks like

> as.data.frame(do.call(rbind, x)) %>% str()
'data.frame':   3 obs. of  3 variables:
 $ q               :List of 3
  ..$ : int 1880
  ..$ : int 1888
  ..$ : int 1890
 $ properties      :List of 3
  ..$ : list()
  ..$ :List of 1
  .. ..$ :List of 2
  .. .. ..$ a: chr "x"
  .. .. ..$ b: chr "y"
  ..$ :List of 1
  .. ..$ :List of 2
  .. .. ..$ a: chr "x"
  .. .. ..$ b: chr "y"
 $ last_Import_Date:List of 3
  ..$ : chr "2024-09-16"
  ..$ : chr "2024-09-16"
  ..$ : chr "2024-09-16"

Upvotes: 4

Related Questions