Rappster
Rappster

Reputation: 13100

Tibble columns of class tibble instead of class data frame

What's the tidy way of having tibble columns of class tibble (instead of class list or data.frame)?

It's clearly possible to have columns of class data.frame in tibbles (see example below), but none of the "tidy ways of data manipulation" (i.e. dplyr::mutate() or purrr::map*_df()) seem to work for me when trying to cast the columns to tibble instead of data.frame

Current ouput of jsonlite::fromJSON()

# 'data.frame': 2 obs. of  3 variables:
#  $ labels  :List of 2
#   ..$ : chr  "label-a" "label-b"
#   ..$ : chr  "label-a" "label-b"
#  $ levelOne:'data.frame': 2 obs. of  1 variable:
#   ..$ levelTwo:'data.frame':  2 obs. of  1 variable:
#   .. ..$ levelThree:List of 2
#   .. .. ..$ :'data.frame':    2 obs. of  3 variables:
#   .. .. .. ..$ x: chr  "A" "B"
#   .. .. .. ..$ y: int  1 2
#   .. .. .. ..$ z: logi  TRUE FALSE
#   .. .. ..$ :'data.frame':    2 obs. of  3 variables:
#   .. .. .. ..$ x: chr  "A" "B"
#   .. .. .. ..$ y: int  10 20
#   .. .. .. ..$ z: logi  FALSE TRUE
#  $ schema  : chr  "0.0.1" "0.0.1"

Desired result

# Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of  3 variables:
#  $ labels  :List of 2
#   ..$ : chr  "label-a" "label-b"
#   ..$ : chr  "label-a" "label-b"
#  $ levelOne:Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of  1 variable:
#   ..$ levelTwo:Classes ‘tbl_df’, ‘tbl’ and 'data.frame':  2 obs. of  1 variable:
#   .. ..$ levelThree:List of 2
#   .. .. ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame':    2 obs. of  3 variables:
#   .. .. .. ..$ x: chr  "A" "B"
#   .. .. .. ..$ y: int  1 2
#   .. .. .. ..$ z: logi  TRUE FALSE
#   .. .. ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame':    2 obs. of  3 variables:
#   .. .. .. ..$ x: chr  "A" "B"
#   .. .. .. ..$ y: int  10 20
#   .. .. .. ..$ z: logi  FALSE TRUE
#  $ schema  : chr  "0.0.1" "0.0.1"

Why having data.frame columns can be very misleading

https://hendrikvanb.gitlab.io/2018/07/nested_data-json_to_tibble/

Related


Example

Example data

library(magrittr)

json <- '[
  {
    "labels": ["label-a", "label-b"],
    "levelOne": {
      "levelTwo": {
        "levelThree": [
          {
            "x": "A",
            "y": 1,
            "z": true
          },
          {
            "x": "B",
            "y": 2,
            "z": false
          }
          ]
      }
    },
    "schema": "0.0.1"
  },
  {
    "labels": ["label-a", "label-b"],
    "levelOne": {
      "levelTwo": {
        "levelThree": [
          {
            "x": "A",
            "y": 10,
            "z": false
          },
          {
            "x": "B",
            "y": 20,
            "z": true
          }
          ]
      }
    },
    "schema": "0.0.1"
  }
]'

When visualizing this, you'll see that there's a subtle but important distinction between objects (which map to data.frames) and array (which map to lists):

enter image description here

Parsing JSON and converting to tibble

x <- json %>% 
  jsonlite::fromJSON() %>% 
  tibble::as_tibble()

x %>% str()
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of  3 variables:
#  $ labels  :List of 2
#   ..$ : chr  "label-a" "label-b"
#   ..$ : chr  "label-a" "label-b"
#  $ levelOne:'data.frame': 2 obs. of  1 variable:
#   ..$ levelTwo:'data.frame':  2 obs. of  1 variable:
#   .. ..$ levelThree:List of 2
#   .. .. ..$ :'data.frame':    2 obs. of  3 variables:
#   .. .. .. ..$ x: chr  "A" "B"
#   .. .. .. ..$ y: int  1 2
#   .. .. .. ..$ z: logi  TRUE FALSE
#   .. .. ..$ :'data.frame':    2 obs. of  3 variables:
#   .. .. .. ..$ x: chr  "A" "B"
#   .. .. .. ..$ y: int  10 20
#   .. .. .. ..$ z: logi  FALSE TRUE
#  $ schema  : chr  "0.0.1" "0.0.1"

So it's clearly possible to have columns that are of class data.frame.

Casting data.frame to tibble columns: "the bad way"

But I'd like tibbles instead of data frames, so let's try the only thing I got to work: explicit re-assigning the respective list levels, or data frame/tibble columns, to be more precise:

# Make a copy so we don't mess with the initial state of `x`
y <- x

y$levelOne <- y$levelOne %>% 
  tibble::as_tibble()
y$levelOne$levelTwo <- y$levelOne$levelTwo %>% 
  tibble::as_tibble()
y$levelOne$levelTwo$levelThree <- y$levelOne$levelTwo$levelThree %>% 
  purrr::map(tibble::as_tibble)

x %>% str()
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of  3 variables:
#  $ labels  :List of 2
#   ..$ : chr  "label-a" "label-b"
#   ..$ : chr  "label-a" "label-b"
#  $ levelOne:Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of  1 variable:
#   ..$ levelTwo:Classes ‘tbl_df’, ‘tbl’ and 'data.frame':  2 obs. of  1 variable:
#   .. ..$ levelThree:List of 2
#   .. .. ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame':    2 obs. of  3 variables:
#   .. .. .. ..$ x: chr  "A" "B"
#   .. .. .. ..$ y: int  1 2
#   .. .. .. ..$ z: logi  TRUE FALSE
#   .. .. ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame':    2 obs. of  3 variables:
#   .. .. .. ..$ x: chr  "A" "B"
#   .. .. .. ..$ y: int  10 20
#   .. .. .. ..$ z: logi  FALSE TRUE
#  $ schema  : chr  "0.0.1" "0.0.1"

That works, but is not in line with "tidy data manipulation pipes".

Casting data.frame to tibble columns: "the better way" (trying and failing)

# Yet another copy so we can compare:
z <- x

# Just to check that this works
z$levelOne %>% 
    tibble::as_tibble()
# # A tibble: 2 x 1
#   levelTwo$levelThree
#   <list>             
# 1 <df[,3] [2 × 3]>   
# 2 <df[,3] [2 × 3]>   

# Trying to get this to work with `dplzr::mutate()` fails:
z %>% 
  dplyr::mutate(levelOne = levelOne %>% 
    tibble::as_tibble()
  )
# Error: Column `levelOne` is of unsupported class data.frame

z %>% 
  dplyr::transmute(levelOne = levelOne %>% 
    tibble::as_tibble()
  )
# Error: Column `levelOne` is of unsupported class data.frame

# Same goes for `{purrr}`:
z %>% 
  dplyr::mutate(levelOne = levelOne %>% 
    purrr::map_df(tibble::as_tibble)
  )
# Error: Column `levelOne` is of unsupported class data.frame

z %>% 
  tibble::add_column(levelOne = z$levelOne %>% tibble::as_tibble())
# Error: Can't add duplicate columns with `add_column()`:
# * Column `levelOne` already exists in `.data`.

# Works, but not what I want:
z %>% 
  tibble::add_column(test = z$levelOne %>% tibble::as_tibble()) %>% 
  str()
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of  4 variables:
#  [...]
#  $ test    :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of  1 variable:
#   ..$ levelTwo:'data.frame':  2 obs. of  1 variable:
#   .. ..$ levelThree:List of 2
#   .. .. ..$ :'data.frame':    2 obs. of  3 variables:
#   .. .. .. ..$ x: chr  "A" "B"
#   .. .. .. ..$ y: int  1 2
#   .. .. .. ..$ z: logi  TRUE FALSE
#   .. .. ..$ :'data.frame':    2 obs. of  3 variables:
#   .. .. .. ..$ x: chr  "A" "B"
#   .. .. .. ..$ y: int  10 20
#   .. .. .. ..$ z: logi  FALSE TRUE

The only thing that worked (is not what we want)

Wrapping tibble::as_tibble() by purrr::map() seems to work, but the result is clearly not what we want as we duplicate everything below levelOne (compare to desired output above)

# Works, but not what I want:
z_new <- z %>% 
  dplyr::mutate(levelOne = levelOne %>% 
    purrr::map(tibble::as_tibble)
  )

z_new %>% str()
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of  3 variables:
#  $ labels  :List of 2
#   ..$ : chr  "label-a" "label-b"
#   ..$ : chr  "label-a" "label-b"
#  $ levelOne:List of 2
#   ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame':  2 obs. of  1 variable:
#   .. ..$ levelThree:List of 2
#   .. .. ..$ :'data.frame':    2 obs. of  3 variables:
#   .. .. .. ..$ x: chr  "A" "B"
#   .. .. .. ..$ y: int  1 2
#   .. .. .. ..$ z: logi  TRUE FALSE
#   .. .. ..$ :'data.frame':    2 obs. of  3 variables:
#   .. .. .. ..$ x: chr  "A" "B"
#   .. .. .. ..$ y: int  10 20
#   .. .. .. ..$ z: logi  FALSE TRUE
#   ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame':  2 obs. of  1 variable:
#   .. ..$ levelThree:List of 2
#   .. .. ..$ :'data.frame':    2 obs. of  3 variables:
#   .. .. .. ..$ x: chr  "A" "B"
#   .. .. .. ..$ y: int  1 2
#   .. .. .. ..$ z: logi  TRUE FALSE
#   .. .. ..$ :'data.frame':    2 obs. of  3 variables:
#   .. .. .. ..$ x: chr  "A" "B"
#   .. .. .. ..$ y: int  10 20
#   .. .. .. ..$ z: logi  FALSE TRUE
#  $ schema  : chr  "0.0.1" "0.0.1"

EDIT (follow-up investigation)

Got it to work with Hendrik's help!

Still, IMO this topic raises some interesting follow-up questions regarding whether or not one should - or even could - do it any other way if the primary goal is to end up with tidy nested tibbles that play nicely with tidyr::unnset() and tidyr::nest() (see comments in Hendrik's answer below).

As to the proposed approach in https://hendrikvanb.gitlab.io/2018/07/nested_data-json_to_tibble/: I might be overlooking something obvious, but I think it only works for JSON docs with a single document.

First, let's modify df_to_tibble() (see Hendrik's answer below) to only turn "leaf" data frames into tibbles while turning "branch" data frames into lists:

leaf_df_to_tibble <- function(x) {
  if (is.data.frame(x)) {
    if (!any(purrr::map_lgl(x, is.list))) { 
      # Only captures "leaf" DFs:
      tibble::as_tibble(x) 
    } else {
      as.list(x)
    }
  } else {
    x
  }
}

This would give us results that are in line with the proposed way in the blog post, but only for "single object" JSON docs as illustrated below

df <- json %>% jsonlite::fromJSON()

# Only take the first object from the parsed JSON:
df_subset <- df[1, ]

Transforming df_subset:

df_subset_tibble <- purrr::reduce(
  0:purrr::vec_depth(df_subset),
  function(x, depth) {
    purrr::modify_depth(x, depth, leaf_df_to_tibble, .ragged = TRUE)
  }, 
  .init = df_subset
) %>% 
  tibble::as_tibble()

df_subset_tibble %>% str()
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 1 obs. of  3 variables:
#  $ labels  :List of 1
#   ..$ : chr  "label-a" "label-b"
#  $ levelOne:List of 1
#   ..$ levelTwo:List of 1
#   .. ..$ levelThree:List of 1
#   .. .. ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame':    2 obs. of  3 variables:
#   .. .. .. ..$ x: chr  "A" "B"
#   .. .. .. ..$ y: int  1 2
#   .. .. .. ..$ z: logi  TRUE FALSE
#  $ schema  : chr "0.0.1"

Transforming df:

df_tibble <- purrr::reduce(
  0:purrr::vec_depth(df),
  function(x, depth) {
    purrr::modify_depth(x, depth, leaf_df_to_tibble, .ragged = TRUE)
  }, 
  .init = df
) %>% 
  tibble::as_tibble()

df_tibble %>% str()
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of  3 variables:
#  $ labels  :List of 2
#   ..$ : chr  "label-a" "label-b"
#   ..$ : chr  "label-a" "label-b"
#  $ levelOne:List of 2
#   ..$ levelTwo:List of 1
#   .. ..$ levelThree:List of 2
#   .. .. ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame':    2 obs. of  3 variables:
#   .. .. .. ..$ x: chr  "A" "B"
#   .. .. .. ..$ y: int  1 2
#   .. .. .. ..$ z: logi  TRUE FALSE
#   .. .. ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame':    2 obs. of  3 variables:
#   .. .. .. ..$ x: chr  "A" "B"
#   .. .. .. ..$ y: int  10 20
#   .. .. .. ..$ z: logi  FALSE TRUE
#   ..$ levelTwo:List of 1
#   .. ..$ levelThree:List of 2
#   .. .. ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame':    2 obs. of  3 variables:
#   .. .. .. ..$ x: chr  "A" "B"
#   .. .. .. ..$ y: int  1 2
#   .. .. .. ..$ z: logi  TRUE FALSE
#   .. .. ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame':    2 obs. of  3 variables:
#   .. .. .. ..$ x: chr  "A" "B"
#   .. .. .. ..$ y: int  10 20
#   .. .. .. ..$ z: logi  FALSE TRUE
#  $ schema  : chr  "0.0.1" "0.0.1"

As we see, "listifying" nested JSON structures actually may results in copying the "leafs". It just doesn't jump at you as long as n = 1 (number of JSON docs), but strikes you as soon as n > 1.

Upvotes: 2

Views: 773

Answers (1)

hendrikvanb
hendrikvanb

Reputation: 459

Background

The comments above raise some valid points. Still, I do believe there is a way to achieve what you're after (whether or not this is a particularly good idea is less clear) by leveraging three functions from the purrr package in combination:

  1. purrr::vec_depth allows us to get the (nesting) depth of a given list,
  2. purrr::modify_depth allows us to apply a function to an list at the specified level of depth, and
  3. purrr::reduce allows us to iteratively apply a function and have the result of each iteration be passed as the input to the subsequent iteration.

General approach

In essence, we want to convert any data.frame found at any level in the list to a tibble. This can easily be achieved using several rounds of purrr::modify_depth where we simply alter the depth depending on the level of the list we wish to target. Crucially, however, we want to do this in a way so that changes to level 1, for example, are retained when we move on to targeting level 2; changes to level 1 and 2 are retained when we move on to level 3; and so on. This is where purrr::reduce comes in: each time we apply purrr::modify_depth to convert a data.frame to a tibble, we'll ensure that the resultant output gets passed as the input to the next iteration. This is illustrated in the MWE below

MWE

Start with the basic setup of data structures and libraries

#> Load libraries ----
library(tidyverse)

json <- '[
  {
    "labels": ["label-a", "label-b"],
    "levelOne": {
      "levelTwo": {
        "levelThree": [
          {
            "x": "A",
            "y": 1,
            "z": true
          },
          {
            "x": "B",
            "y": 2,
            "z": false
          }
          ]
      }
    },
    "schema": "0.0.1"
  },
  {
    "labels": ["label-a", "label-b"],
    "levelOne": {
      "levelTwo": {
        "levelThree": [
          {
            "x": "A",
            "y": 10,
            "z": false
          },
          {
            "x": "B",
            "y": 20,
            "z": true
          }
          ]
      }
    },
    "schema": "0.0.1"
  }
]'  

# convert json to a nested data.frame
df <- jsonlite::fromJSON(json)

Now we'll create a simple helper function that can conditionally convert data.frame to tibble

# define a simple function to convert data.frame to tibble
df_to_tibble <- function(x) {
  if (is.data.frame(x)) as_tibble(x) else x
}

Now for the crucial routine: Taking df as the initial starting point (.init = df), apply the df_to_tibble function at each level of df (0:purrr::vec_depth(df)) using purrr::modify_depth. Use purrr::reduce to ensure that the results from each individual iteration gets passed as the input to the subsequent iteration.

# create df_tibble by reducing the result of applying df_to_tibble to each level
# of df via purrr's modify_depth function %>% lastly, ensure that the top level
# data.frame is also converted to a tibble
df_tibble <- purrr::reduce(
  0:purrr::vec_depth(df),
  function(x, depth) {
    purrr::modify_depth(x, depth, df_to_tibble, .ragged = TRUE)
  }, 
  .init = df
) %>% 
  as_tibble()
# show the structure of df_tibble
str(df_tibble)
#> Classes 'tbl_df', 'tbl' and 'data.frame':    2 obs. of  3 variables:
#>  $ labels  :List of 2
#>   ..$ : chr  "label-a" "label-b"
#>   ..$ : chr  "label-a" "label-b"
#>  $ levelOne:Classes 'tbl_df', 'tbl' and 'data.frame':    2 obs. of  1 variable:
#>   ..$ levelTwo:Classes 'tbl_df', 'tbl' and 'data.frame': 2 obs. of  1 variable:
#>   .. ..$ levelThree:List of 2
#>   .. .. ..$ :Classes 'tbl_df', 'tbl' and 'data.frame':   2 obs. of  3 variables:
#>   .. .. .. ..$ x: chr  "A" "B"
#>   .. .. .. ..$ y: int  1 2
#>   .. .. .. ..$ z: logi  TRUE FALSE
#>   .. .. ..$ :Classes 'tbl_df', 'tbl' and 'data.frame':   2 obs. of  3 variables:
#>   .. .. .. ..$ x: chr  "A" "B"
#>   .. .. .. ..$ y: int  10 20
#>   .. .. .. ..$ z: logi  FALSE TRUE
#>  $ schema  : chr  "0.0.1" "0.0.1"

Upvotes: 1

Related Questions