Reputation: 886
I want to gather() list columns to create new rows in my data frame. I'm using the Game of Thrones data set in the repurrrsive package. Below is my code to set up the problem:
library(tidyverse)
got_chars <- repurrrsive::got_chars
df <- got_chars %>%
{
tibble::tibble(
Name = map_chr(., 'name'),
Gender = map_chr(.,'gender'),
Culture = map_chr(.,'culture'),
Born = map_chr(.,'born'),
Alive = map_chr(.,'alive'),
Titles = map(.,'titles'),
Aliases = map(., "aliases"),
Allegiances = map(., "allegiances"),
Books = map(.,'books'),
POV_Books = map(.,'povBooks'),
TV_Series = map(.,'tvSeries'),
Actor = map(.,'playedBy')
)
}
What I would like to be able to do, but can't figure out is to gather()
the list columns (e.g. Books, POV_Books, etc.) in order to create a new row for each record. For example:
Name | Book
Theon Greyjoy | A Game of Thrones
Theon Greyjoy | A Storm of Swords
Theon Greyjoy | A Feast for Crows
The closest I've been able to get is:
df_books <- df %>%
separate_rows(Books,sep="\"")
This will work, but leaves a trail of garbage behind from the c() characters within the vectors. I can filter those out, but I feel like there's a much better way and I might just not be trying the correct functions. Any suggestions would be much appreciated, thanks!
Upvotes: 3
Views: 3345
Reputation: 47350
You can use unnest
, but first you have to format the column in a way tidyr
understands.
This means:
NULL
elements in the columns you want to unnestdata.frames
and not vectors
.
library(tidyverse)
df %>%
select(Name,Books) %>% # skip this line to keep all columns
slice(which(lengths(Books)>0)) %>%
mutate(Books = map(Books,~tibble(Book=.x))) %>%
unnest(Books)
# # A tibble: 77 x 2
# Name Book
# <chr> <chr>
# 1 Theon Greyjoy A Game of Thrones
# 2 Theon Greyjoy A Storm of Swords
# 3 Theon Greyjoy A Feast for Crows
# 4 Tyrion Lannister A Feast for Crows
# 5 Tyrion Lannister The World of Ice and Fire
# 6 Victarion Greyjoy A Game of Thrones
# 7 Victarion Greyjoy A Clash of Kings
# 8 Victarion Greyjoy A Storm of Swords
# 9 Will A Clash of Kings
# 10 Areo Hotah A Game of Thrones
# # ... with 67 more rows
Your tried solution works fine though if we filter the output (same output as my solution):
df %>%
select(Name, Books) %>%
separate_rows(Books,sep="\"") %>%
filter(!Books %in% c("c(",", ",")") & lengths(Books)>0)
Upvotes: 2
Reputation: 193677
Your tibble currently looks like this:
df
# # A tibble: 30 x 12
# Name Gender Culture Born Alive Titles Aliases Allegiances Books POV_Books TV_Series Actor
# <chr> <chr> <chr> <chr> <chr> <list> <list> <list> <lis> <list> <list> <lis>
# 1 Theon Greyjoy Male Ironbo… In 2… TRUE <chr … <chr [… <chr [1]> <chr… <chr [2]> <chr [6]> <chr…
# 2 Tyrion Lannister Male "" In 2… TRUE <chr … <chr [… <chr [1]> <chr… <chr [4]> <chr [6]> <chr…
# 3 Victarion Greyjoy Male Ironbo… In 2… TRUE <chr … <chr [… <chr [1]> <chr… <chr [2]> <chr [1]> <chr…
# 4 Will Male "" "" FALSE <chr … <chr [… <NULL> <chr… <chr [1]> <chr [1]> <chr…
# 5 Areo Hotah Male Norvos… In 2… TRUE <chr … <chr [… <chr [1]> <chr… <chr [2]> <chr [2]> <chr…
# 6 Chett Male "" At H… FALSE <chr … <chr [… <NULL> <chr… <chr [1]> <chr [1]> <chr…
# 7 Cressen Male "" In 2… FALSE <chr … <chr [… <NULL> <chr… <chr [1]> <chr [1]> <chr…
# 8 Arianne Martell Female Dornish In 2… TRUE <chr … <chr [… <chr [1]> <chr… <chr [1]> <chr [1]> <chr…
# 9 Daenerys Targaryen Female Valyri… In 2… TRUE <chr … <chr [… <chr [1]> <chr… <chr [4]> <chr [6]> <chr…
# 10 Davos Seaworth Male Wester… In 2… TRUE <chr … <chr [… <chr [2]> <chr… <chr [3]> <chr [5]> <chr…
# # ... with 20 more rows
unnest()
would be an obvious choice, but doesn't work if all of the lists are the same in terms of how many values they would expand to.
library(tidyverse)
unnest(df)
# Error: All nested columns must have the same number of elements.
One approach would be to use the following functions. flatten()
makes the data "wide", and flattenLong()
takes the "wide" data and makes it "long". The assumption that has been made about missing data is that if a vector in a list item is shorter than the matching vector in another list item, the missing data are last.
flatten <- function(indt, cols, drop = FALSE) {
require(data.table)
if (!is.data.table(indt)) indt <- as.data.table(indt)
x <- unlist(indt[, lapply(.SD, function(x) max(lengths(x))), .SDcols = cols])
nams <- paste(rep(cols, x), sequence(x), sep = "_")
indt[, (nams) := unlist(lapply(.SD, data.table::transpose), recursive = FALSE), .SDcols = (cols)]
if (isTRUE(drop)) indt[, (cols) := NULL]
indt[]
}
flattenLong <- function(indt, cols) {
ob <- setdiff(names(indt), cols)
x <- flatten(indt, cols, TRUE)
mv <- lapply(cols, function(y) grep(sprintf("^%s_", y), names(x)))
setorderv(melt(x, measure.vars = mv, value.name = cols), ob)[]
}
Here's one way to use it, applying it to all list
columns.
flattenLong(df, names(df)[sapply(df, is.list)])
# Name Gender Culture Born Alive variable
# 1: Aeron Greyjoy Male Ironborn In or between 269 AC and 273 AC, at Pyke TRUE 1
# 2: Aeron Greyjoy Male Ironborn In or between 269 AC and 273 AC, at Pyke TRUE 2
# 3: Aeron Greyjoy Male Ironborn In or between 269 AC and 273 AC, at Pyke TRUE 3
# 4: Aeron Greyjoy Male Ironborn In or between 269 AC and 273 AC, at Pyke TRUE 4
# 5: Aeron Greyjoy Male Ironborn In or between 269 AC and 273 AC, at Pyke TRUE 5
# ---
# 476: Will Male FALSE 12
# 477: Will Male FALSE 13
# 478: Will Male FALSE 14
# 479: Will Male FALSE 15
# 480: Will Male FALSE 16
# Titles Aliases Allegiances Books
# 1: Priest of the Drowned God The Damphair House Greyjoy of Pyke A Game of Thrones
# 2: Captain of the Golden Storm (formerly) Aeron Damphair NA A Clash of Kings
# 3: NA NA NA A Storm of Swords
# 4: NA NA NA A Dance with Dragons
# 5: NA NA NA NA
# ---
# 476: NA NA NA NA
# 477: NA NA NA NA
# 478: NA NA NA NA
# 479: NA NA NA NA
# 480: NA NA NA NA
# POV_Books TV_Series Actor
# 1: A Feast for Crows Season 6 Michael Feast
# 2: NA NA NA
# 3: NA NA NA
# 4: NA NA NA
# 5: NA NA NA
# ---
# 476: NA NA NA
# 477: NA NA NA
# 478: NA NA NA
# 479: NA NA NA
# 480: NA NA NA
You could also do something like any of the following for dealing with a single column:
flattenLong(df[c(names(df)[!sapply(df, is.list)], "Books")], "Books")
flattenLong(df[c("Name", "Gender", "Culture", "Born", "Alive", "Books")], "Books")
df %>%
select(Name, Gender, Culture, Born, Alive, Books) %>%
flattenLong("Books")
This isn't at all meant to be identical to the "tidyverse" approach. It handles NULL
differently, and unnest
s each group to be the same length. Consider the following dataset:
mydf <- data.frame(V1 = c("a", "b", "c"),
V2 = I(list(c(10, 20), NA_real_, c(20, 40, 60))),
V3 = I(list(NULL, c("x", "y", "z"), c("BA", "BB"))))
mydf
# V1 V2 V3
# 1 a 10, 20
# 2 b NA x, y, z
# 3 c 20, 40, 60 BA, BB
Difference #1: Number of values per group:
# Note the resulting number of values per group
# Equivalent of
# as.data.table(mydf)[, list(unlist(V2)), V1]
mydf %>% select(V1, V2) %>% unnest()
# V1 V2
# 1 a 10
# 2 a 20
# 3 b NA
# 4 c 20
# 5 c 40
# 6 c 60
flattenLong(mydf[c("V1", "V2")], "V2")
# V1 variable V2
# 1: a V2_1 10
# 2: a V2_2 20
# 3: a V2_3 NA
# 4: b V2_1 NA
# 5: b V2_2 NA
# 6: b V2_3 NA
# 7: c V2_1 20
# 8: c V2_2 40
# 9: c V2_3 60
Difference #2: Handling NULL
values
mydf %>% select(V1, V3) %>% unnest()
# Error: Each column must either be a list of vectors or a list of data frames [V3]
flattenLong(mydf[c("V1", "V2")], "V2")
# V1 variable V3
# 1: a V3_1 NA
# 2: a V3_2 NA
# 3: a V3_3 NA
# 4: b V3_1 x
# 5: b V3_2 y
# 6: b V3_3 z
# 7: c V3_1 BA
# 8: c V3_2 BB
# 9: c V3_3 NA
Upvotes: 3