Mark Druffel
Mark Druffel

Reputation: 886

Gather() list columns to rows in R

I want to gather() list columns to create new rows in my data frame. I'm using the Game of Thrones data set in the repurrrsive package. Below is my code to set up the problem:

library(tidyverse)
got_chars <- repurrrsive::got_chars
    df <- got_chars %>% 
    {
      tibble::tibble(
        Name = map_chr(., 'name'),
        Gender = map_chr(.,'gender'),
        Culture = map_chr(.,'culture'),
        Born = map_chr(.,'born'),
        Alive = map_chr(.,'alive'),
        Titles = map(.,'titles'),
        Aliases = map(., "aliases"),
        Allegiances = map(., "allegiances"),
        Books = map(.,'books'),
        POV_Books = map(.,'povBooks'),
        TV_Series = map(.,'tvSeries'),
        Actor = map(.,'playedBy')
      )
    }

What I would like to be able to do, but can't figure out is to gather() the list columns (e.g. Books, POV_Books, etc.) in order to create a new row for each record. For example:

Name | Book
Theon Greyjoy | A Game of Thrones
Theon Greyjoy | A Storm of Swords
Theon Greyjoy | A Feast for Crows

The closest I've been able to get is:

df_books <- df %>%
  separate_rows(Books,sep="\"")

This will work, but leaves a trail of garbage behind from the c() characters within the vectors. I can filter those out, but I feel like there's a much better way and I might just not be trying the correct functions. Any suggestions would be much appreciated, thanks!

Upvotes: 3

Views: 3345

Answers (2)

moodymudskipper
moodymudskipper

Reputation: 47350

You can use unnest, but first you have to format the column in a way tidyr understands.

This means:

  • No NULL elements in the columns you want to unnest
  • having your items as one column data.frames and not vectors

.

library(tidyverse)
df %>%
  select(Name,Books) %>% # skip this line to keep all columns
  slice(which(lengths(Books)>0)) %>%
  mutate(Books = map(Books,~tibble(Book=.x))) %>%
  unnest(Books)

# # A tibble: 77 x 2
#                Name                      Book
#               <chr>                     <chr>
# 1     Theon Greyjoy         A Game of Thrones
# 2     Theon Greyjoy         A Storm of Swords
# 3     Theon Greyjoy         A Feast for Crows
# 4  Tyrion Lannister         A Feast for Crows
# 5  Tyrion Lannister The World of Ice and Fire
# 6 Victarion Greyjoy         A Game of Thrones
# 7 Victarion Greyjoy          A Clash of Kings
# 8 Victarion Greyjoy         A Storm of Swords
# 9              Will          A Clash of Kings
# 10        Areo Hotah         A Game of Thrones
# # ... with 67 more rows

Your tried solution works fine though if we filter the output (same output as my solution):

df %>%
  select(Name, Books) %>%
  separate_rows(Books,sep="\"") %>%
  filter(!Books %in% c("c(",", ",")") & lengths(Books)>0)

Upvotes: 2

A5C1D2H2I1M1N2O1R2T1
A5C1D2H2I1M1N2O1R2T1

Reputation: 193677

Your tibble currently looks like this:

df
# # A tibble: 30 x 12
#    Name               Gender Culture Born  Alive Titles Aliases Allegiances Books POV_Books TV_Series Actor
#    <chr>              <chr>  <chr>   <chr> <chr> <list> <list>  <list>      <lis> <list>    <list>    <lis>
#  1 Theon Greyjoy      Male   Ironbo… In 2… TRUE  <chr … <chr [… <chr [1]>   <chr… <chr [2]> <chr [6]> <chr…
#  2 Tyrion Lannister   Male   ""      In 2… TRUE  <chr … <chr [… <chr [1]>   <chr… <chr [4]> <chr [6]> <chr…
#  3 Victarion Greyjoy  Male   Ironbo… In 2… TRUE  <chr … <chr [… <chr [1]>   <chr… <chr [2]> <chr [1]> <chr…
#  4 Will               Male   ""      ""    FALSE <chr … <chr [… <NULL>      <chr… <chr [1]> <chr [1]> <chr…
#  5 Areo Hotah         Male   Norvos… In 2… TRUE  <chr … <chr [… <chr [1]>   <chr… <chr [2]> <chr [2]> <chr…
#  6 Chett              Male   ""      At H… FALSE <chr … <chr [… <NULL>      <chr… <chr [1]> <chr [1]> <chr…
#  7 Cressen            Male   ""      In 2… FALSE <chr … <chr [… <NULL>      <chr… <chr [1]> <chr [1]> <chr…
#  8 Arianne Martell    Female Dornish In 2… TRUE  <chr … <chr [… <chr [1]>   <chr… <chr [1]> <chr [1]> <chr…
#  9 Daenerys Targaryen Female Valyri… In 2… TRUE  <chr … <chr [… <chr [1]>   <chr… <chr [4]> <chr [6]> <chr…
# 10 Davos Seaworth     Male   Wester… In 2… TRUE  <chr … <chr [… <chr [2]>   <chr… <chr [3]> <chr [5]> <chr…
# # ... with 20 more rows

unnest() would be an obvious choice, but doesn't work if all of the lists are the same in terms of how many values they would expand to.

library(tidyverse)
unnest(df)
# Error: All nested columns must have the same number of elements.

One approach would be to use the following functions. flatten() makes the data "wide", and flattenLong() takes the "wide" data and makes it "long". The assumption that has been made about missing data is that if a vector in a list item is shorter than the matching vector in another list item, the missing data are last.

flatten <- function(indt, cols, drop = FALSE) {
  require(data.table)
  if (!is.data.table(indt)) indt <- as.data.table(indt)
  x <- unlist(indt[, lapply(.SD, function(x) max(lengths(x))), .SDcols = cols])
  nams <- paste(rep(cols, x), sequence(x), sep = "_")
  indt[, (nams) := unlist(lapply(.SD, data.table::transpose), recursive = FALSE), .SDcols = (cols)]
  if (isTRUE(drop)) indt[, (cols) := NULL]
  indt[]
}

flattenLong <- function(indt, cols) {
  ob <- setdiff(names(indt), cols)
  x <- flatten(indt, cols, TRUE)
  mv <- lapply(cols, function(y) grep(sprintf("^%s_", y), names(x)))
  setorderv(melt(x, measure.vars = mv, value.name = cols), ob)[]
}

Here's one way to use it, applying it to all list columns.

flattenLong(df, names(df)[sapply(df, is.list)])
#               Name Gender  Culture                                     Born Alive variable
#   1: Aeron Greyjoy   Male Ironborn In or between 269 AC and 273 AC, at Pyke  TRUE        1
#   2: Aeron Greyjoy   Male Ironborn In or between 269 AC and 273 AC, at Pyke  TRUE        2
#   3: Aeron Greyjoy   Male Ironborn In or between 269 AC and 273 AC, at Pyke  TRUE        3
#   4: Aeron Greyjoy   Male Ironborn In or between 269 AC and 273 AC, at Pyke  TRUE        4
#   5: Aeron Greyjoy   Male Ironborn In or between 269 AC and 273 AC, at Pyke  TRUE        5
# ---                                                                                      
# 476:          Will   Male                                                   FALSE       12
# 477:          Will   Male                                                   FALSE       13
# 478:          Will   Male                                                   FALSE       14
# 479:          Will   Male                                                   FALSE       15
# 480:          Will   Male                                                   FALSE       16
#                                      Titles        Aliases           Allegiances                Books
#   1:              Priest of the Drowned God   The Damphair House Greyjoy of Pyke    A Game of Thrones
#   2: Captain of the Golden Storm (formerly) Aeron Damphair                    NA     A Clash of Kings
#   3:                                     NA             NA                    NA    A Storm of Swords
#   4:                                     NA             NA                    NA A Dance with Dragons
#   5:                                     NA             NA                    NA                   NA
# ---                                                                                                 
# 476:                                     NA             NA                    NA                   NA
# 477:                                     NA             NA                    NA                   NA
# 478:                                     NA             NA                    NA                   NA
# 479:                                     NA             NA                    NA                   NA
# 480:                                     NA             NA                    NA                   NA
#              POV_Books TV_Series         Actor
#   1: A Feast for Crows  Season 6 Michael Feast
#   2:                NA        NA            NA
#   3:                NA        NA            NA
#   4:                NA        NA            NA
#   5:                NA        NA            NA
# ---                                          
# 476:                NA        NA            NA
# 477:                NA        NA            NA
# 478:                NA        NA            NA
# 479:                NA        NA            NA
# 480:                NA        NA            NA

You could also do something like any of the following for dealing with a single column:

flattenLong(df[c(names(df)[!sapply(df, is.list)], "Books")], "Books")

flattenLong(df[c("Name", "Gender", "Culture", "Born", "Alive", "Books")], "Books")

df %>% 
  select(Name, Gender, Culture, Born, Alive, Books) %>%
  flattenLong("Books")

This isn't at all meant to be identical to the "tidyverse" approach. It handles NULL differently, and unnests each group to be the same length. Consider the following dataset:

mydf <- data.frame(V1 = c("a", "b", "c"), 
                   V2 = I(list(c(10, 20), NA_real_, c(20, 40, 60))), 
                   V3 = I(list(NULL, c("x", "y", "z"), c("BA", "BB"))))
mydf                   
#   V1         V2      V3
# 1  a     10, 20        
# 2  b         NA x, y, z
# 3  c 20, 40, 60  BA, BB

Difference #1: Number of values per group:

# Note the resulting number of values per group
# Equivalent of
# as.data.table(mydf)[, list(unlist(V2)), V1]
mydf %>% select(V1, V2) %>% unnest()
#   V1 V2
# 1  a 10
# 2  a 20
# 3  b NA
# 4  c 20
# 5  c 40
# 6  c 60

flattenLong(mydf[c("V1", "V2")], "V2")
#    V1 variable V2
# 1:  a     V2_1 10
# 2:  a     V2_2 20
# 3:  a     V2_3 NA
# 4:  b     V2_1 NA
# 5:  b     V2_2 NA
# 6:  b     V2_3 NA
# 7:  c     V2_1 20
# 8:  c     V2_2 40
# 9:  c     V2_3 60

Difference #2: Handling NULL values

mydf %>% select(V1, V3) %>% unnest()
# Error: Each column must either be a list of vectors or a list of data frames [V3]

flattenLong(mydf[c("V1", "V2")], "V2")
#    V1 variable V3
# 1:  a     V3_1 NA
# 2:  a     V3_2 NA
# 3:  a     V3_3 NA
# 4:  b     V3_1  x
# 5:  b     V3_2  y
# 6:  b     V3_3  z
# 7:  c     V3_1 BA
# 8:  c     V3_2 BB
# 9:  c     V3_3 NA

Upvotes: 3

Related Questions