Reputation: 13100
What's the tidy way of having tibble
columns of class tibble
(instead of class list
or data.frame
)?
It's clearly possible to have columns of class data.frame
in tibble
s (see
example below), but none of the "tidy ways of data manipulation" (i.e.
dplyr::mutate()
or purrr::map*_df()
) seem to work for me when trying to cast the columns to tibble
instead of data.frame
jsonlite::fromJSON()
# 'data.frame': 2 obs. of 3 variables:
# $ labels :List of 2
# ..$ : chr "label-a" "label-b"
# ..$ : chr "label-a" "label-b"
# $ levelOne:'data.frame': 2 obs. of 1 variable:
# ..$ levelTwo:'data.frame': 2 obs. of 1 variable:
# .. ..$ levelThree:List of 2
# .. .. ..$ :'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 1 2
# .. .. .. ..$ z: logi TRUE FALSE
# .. .. ..$ :'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 10 20
# .. .. .. ..$ z: logi FALSE TRUE
# $ schema : chr "0.0.1" "0.0.1"
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 3 variables:
# $ labels :List of 2
# ..$ : chr "label-a" "label-b"
# ..$ : chr "label-a" "label-b"
# $ levelOne:Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 1 variable:
# ..$ levelTwo:Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 1 variable:
# .. ..$ levelThree:List of 2
# .. .. ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 1 2
# .. .. .. ..$ z: logi TRUE FALSE
# .. .. ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 10 20
# .. .. .. ..$ z: logi FALSE TRUE
# $ schema : chr "0.0.1" "0.0.1"
data.frame
columns can be very misleadinghttps://hendrikvanb.gitlab.io/2018/07/nested_data-json_to_tibble/
library(magrittr)
json <- '[
{
"labels": ["label-a", "label-b"],
"levelOne": {
"levelTwo": {
"levelThree": [
{
"x": "A",
"y": 1,
"z": true
},
{
"x": "B",
"y": 2,
"z": false
}
]
}
},
"schema": "0.0.1"
},
{
"labels": ["label-a", "label-b"],
"levelOne": {
"levelTwo": {
"levelThree": [
{
"x": "A",
"y": 10,
"z": false
},
{
"x": "B",
"y": 20,
"z": true
}
]
}
},
"schema": "0.0.1"
}
]'
When visualizing this, you'll see that there's a subtle but important distinction between objects (which map to data.frame
s) and array (which map to list
s):
tibble
x <- json %>%
jsonlite::fromJSON() %>%
tibble::as_tibble()
x %>% str()
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 3 variables:
# $ labels :List of 2
# ..$ : chr "label-a" "label-b"
# ..$ : chr "label-a" "label-b"
# $ levelOne:'data.frame': 2 obs. of 1 variable:
# ..$ levelTwo:'data.frame': 2 obs. of 1 variable:
# .. ..$ levelThree:List of 2
# .. .. ..$ :'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 1 2
# .. .. .. ..$ z: logi TRUE FALSE
# .. .. ..$ :'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 10 20
# .. .. .. ..$ z: logi FALSE TRUE
# $ schema : chr "0.0.1" "0.0.1"
So it's clearly possible to have columns that are of class data.frame
.
data.frame
to tibble
columns: "the bad way"But I'd like tibbles instead of data frames, so let's try the only thing I got to work: explicit re-assigning the respective list levels, or data frame/tibble columns, to be more precise:
# Make a copy so we don't mess with the initial state of `x`
y <- x
y$levelOne <- y$levelOne %>%
tibble::as_tibble()
y$levelOne$levelTwo <- y$levelOne$levelTwo %>%
tibble::as_tibble()
y$levelOne$levelTwo$levelThree <- y$levelOne$levelTwo$levelThree %>%
purrr::map(tibble::as_tibble)
x %>% str()
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 3 variables:
# $ labels :List of 2
# ..$ : chr "label-a" "label-b"
# ..$ : chr "label-a" "label-b"
# $ levelOne:Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 1 variable:
# ..$ levelTwo:Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 1 variable:
# .. ..$ levelThree:List of 2
# .. .. ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 1 2
# .. .. .. ..$ z: logi TRUE FALSE
# .. .. ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 10 20
# .. .. .. ..$ z: logi FALSE TRUE
# $ schema : chr "0.0.1" "0.0.1"
That works, but is not in line with "tidy data manipulation pipes".
data.frame
to tibble
columns: "the better way" (trying and failing)# Yet another copy so we can compare:
z <- x
# Just to check that this works
z$levelOne %>%
tibble::as_tibble()
# # A tibble: 2 x 1
# levelTwo$levelThree
# <list>
# 1 <df[,3] [2 × 3]>
# 2 <df[,3] [2 × 3]>
# Trying to get this to work with `dplzr::mutate()` fails:
z %>%
dplyr::mutate(levelOne = levelOne %>%
tibble::as_tibble()
)
# Error: Column `levelOne` is of unsupported class data.frame
z %>%
dplyr::transmute(levelOne = levelOne %>%
tibble::as_tibble()
)
# Error: Column `levelOne` is of unsupported class data.frame
# Same goes for `{purrr}`:
z %>%
dplyr::mutate(levelOne = levelOne %>%
purrr::map_df(tibble::as_tibble)
)
# Error: Column `levelOne` is of unsupported class data.frame
z %>%
tibble::add_column(levelOne = z$levelOne %>% tibble::as_tibble())
# Error: Can't add duplicate columns with `add_column()`:
# * Column `levelOne` already exists in `.data`.
# Works, but not what I want:
z %>%
tibble::add_column(test = z$levelOne %>% tibble::as_tibble()) %>%
str()
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 4 variables:
# [...]
# $ test :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 1 variable:
# ..$ levelTwo:'data.frame': 2 obs. of 1 variable:
# .. ..$ levelThree:List of 2
# .. .. ..$ :'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 1 2
# .. .. .. ..$ z: logi TRUE FALSE
# .. .. ..$ :'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 10 20
# .. .. .. ..$ z: logi FALSE TRUE
Wrapping tibble::as_tibble()
by purrr::map()
seems to work, but the result is clearly not what we want as we duplicate everything below levelOne
(compare to desired output above)
# Works, but not what I want:
z_new <- z %>%
dplyr::mutate(levelOne = levelOne %>%
purrr::map(tibble::as_tibble)
)
z_new %>% str()
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 3 variables:
# $ labels :List of 2
# ..$ : chr "label-a" "label-b"
# ..$ : chr "label-a" "label-b"
# $ levelOne:List of 2
# ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 1 variable:
# .. ..$ levelThree:List of 2
# .. .. ..$ :'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 1 2
# .. .. .. ..$ z: logi TRUE FALSE
# .. .. ..$ :'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 10 20
# .. .. .. ..$ z: logi FALSE TRUE
# ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 1 variable:
# .. ..$ levelThree:List of 2
# .. .. ..$ :'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 1 2
# .. .. .. ..$ z: logi TRUE FALSE
# .. .. ..$ :'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 10 20
# .. .. .. ..$ z: logi FALSE TRUE
# $ schema : chr "0.0.1" "0.0.1"
Got it to work with Hendrik's help!
Still, IMO this topic raises some interesting follow-up questions regarding
whether or not one should - or even could - do it any other way if the primary
goal is to end up with tidy nested tibbles that play nicely with
tidyr::unnset()
and tidyr::nest()
(see comments in Hendrik's answer below).
As to the proposed approach in https://hendrikvanb.gitlab.io/2018/07/nested_data-json_to_tibble/: I might be overlooking something obvious, but I think it only works for JSON docs with a single document.
First, let's modify df_to_tibble()
(see Hendrik's answer below) to only turn
"leaf" data frames into tibbles while turning "branch" data frames into lists:
leaf_df_to_tibble <- function(x) {
if (is.data.frame(x)) {
if (!any(purrr::map_lgl(x, is.list))) {
# Only captures "leaf" DFs:
tibble::as_tibble(x)
} else {
as.list(x)
}
} else {
x
}
}
This would give us results that are in line with the proposed way in the blog post, but only for "single object" JSON docs as illustrated below
df <- json %>% jsonlite::fromJSON()
# Only take the first object from the parsed JSON:
df_subset <- df[1, ]
Transforming df_subset
:
df_subset_tibble <- purrr::reduce(
0:purrr::vec_depth(df_subset),
function(x, depth) {
purrr::modify_depth(x, depth, leaf_df_to_tibble, .ragged = TRUE)
},
.init = df_subset
) %>%
tibble::as_tibble()
df_subset_tibble %>% str()
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 1 obs. of 3 variables:
# $ labels :List of 1
# ..$ : chr "label-a" "label-b"
# $ levelOne:List of 1
# ..$ levelTwo:List of 1
# .. ..$ levelThree:List of 1
# .. .. ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 1 2
# .. .. .. ..$ z: logi TRUE FALSE
# $ schema : chr "0.0.1"
Transforming df
:
df_tibble <- purrr::reduce(
0:purrr::vec_depth(df),
function(x, depth) {
purrr::modify_depth(x, depth, leaf_df_to_tibble, .ragged = TRUE)
},
.init = df
) %>%
tibble::as_tibble()
df_tibble %>% str()
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 3 variables:
# $ labels :List of 2
# ..$ : chr "label-a" "label-b"
# ..$ : chr "label-a" "label-b"
# $ levelOne:List of 2
# ..$ levelTwo:List of 1
# .. ..$ levelThree:List of 2
# .. .. ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 1 2
# .. .. .. ..$ z: logi TRUE FALSE
# .. .. ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 10 20
# .. .. .. ..$ z: logi FALSE TRUE
# ..$ levelTwo:List of 1
# .. ..$ levelThree:List of 2
# .. .. ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 1 2
# .. .. .. ..$ z: logi TRUE FALSE
# .. .. ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 10 20
# .. .. .. ..$ z: logi FALSE TRUE
# $ schema : chr "0.0.1" "0.0.1"
As we see, "listifying" nested JSON structures actually may results in copying
the "leafs". It just doesn't jump at you as long as n = 1
(number of JSON
docs), but strikes you as soon as n > 1
.
Upvotes: 2
Views: 773
Reputation: 459
The comments above raise some valid points. Still, I do believe there is a way to achieve what you're after (whether or not this is a particularly good idea is less clear) by leveraging three functions from the purrr
package in combination:
purrr::vec_depth
allows us to get the (nesting) depth of a given list,purrr::modify_depth
allows us to apply a function to an list at the specified level of depth, andpurrr::reduce
allows us to iteratively apply a function and have the result of each iteration be passed as the input to the subsequent iteration.In essence, we want to convert any data.frame
found at any level in the list to a tibble
. This can easily be achieved using several rounds of purrr::modify_depth
where we simply alter the depth depending on the level of the list we wish to target. Crucially, however, we want to do this in a way so that changes to level 1, for example, are retained when we move on to targeting level 2; changes to level 1 and 2 are retained when we move on to level 3; and so on. This is where purrr::reduce
comes in: each time we apply purrr::modify_depth
to convert a data.frame to a tibble, we'll ensure that the resultant output gets passed as the input to the next iteration. This is illustrated in the MWE below
Start with the basic setup of data structures and libraries
#> Load libraries ----
library(tidyverse)
json <- '[
{
"labels": ["label-a", "label-b"],
"levelOne": {
"levelTwo": {
"levelThree": [
{
"x": "A",
"y": 1,
"z": true
},
{
"x": "B",
"y": 2,
"z": false
}
]
}
},
"schema": "0.0.1"
},
{
"labels": ["label-a", "label-b"],
"levelOne": {
"levelTwo": {
"levelThree": [
{
"x": "A",
"y": 10,
"z": false
},
{
"x": "B",
"y": 20,
"z": true
}
]
}
},
"schema": "0.0.1"
}
]'
# convert json to a nested data.frame
df <- jsonlite::fromJSON(json)
Now we'll create a simple helper function that can conditionally convert data.frame
to tibble
# define a simple function to convert data.frame to tibble
df_to_tibble <- function(x) {
if (is.data.frame(x)) as_tibble(x) else x
}
Now for the crucial routine: Taking df
as the initial starting point (.init = df
), apply the df_to_tibble
function at each level of df
(0:purrr::vec_depth(df)
) using purrr::modify_depth
. Use purrr::reduce
to ensure that the results from each individual iteration gets passed as the input to the subsequent iteration.
# create df_tibble by reducing the result of applying df_to_tibble to each level
# of df via purrr's modify_depth function %>% lastly, ensure that the top level
# data.frame is also converted to a tibble
df_tibble <- purrr::reduce(
0:purrr::vec_depth(df),
function(x, depth) {
purrr::modify_depth(x, depth, df_to_tibble, .ragged = TRUE)
},
.init = df
) %>%
as_tibble()
# show the structure of df_tibble
str(df_tibble)
#> Classes 'tbl_df', 'tbl' and 'data.frame': 2 obs. of 3 variables:
#> $ labels :List of 2
#> ..$ : chr "label-a" "label-b"
#> ..$ : chr "label-a" "label-b"
#> $ levelOne:Classes 'tbl_df', 'tbl' and 'data.frame': 2 obs. of 1 variable:
#> ..$ levelTwo:Classes 'tbl_df', 'tbl' and 'data.frame': 2 obs. of 1 variable:
#> .. ..$ levelThree:List of 2
#> .. .. ..$ :Classes 'tbl_df', 'tbl' and 'data.frame': 2 obs. of 3 variables:
#> .. .. .. ..$ x: chr "A" "B"
#> .. .. .. ..$ y: int 1 2
#> .. .. .. ..$ z: logi TRUE FALSE
#> .. .. ..$ :Classes 'tbl_df', 'tbl' and 'data.frame': 2 obs. of 3 variables:
#> .. .. .. ..$ x: chr "A" "B"
#> .. .. .. ..$ y: int 10 20
#> .. .. .. ..$ z: logi FALSE TRUE
#> $ schema : chr "0.0.1" "0.0.1"
Upvotes: 1