Reputation: 10173
Here is 1 row of data we are fetching from a sports API that comes into us as a nested list. Our fetch_results$data
is a list with a nested lists like this for each of many games, as this data is for many soccer matches. The list-of-list nesting can go 3-4 layers deep, with inner lists for scores
, and time
, and visitorTeam
below, and more.
> dput(fetch_results$data[1])
list(list(id = 11984409L, league_id = 1326L, season_id = 15733L,
stage_id = 77442469L, round_id = 186274L, group_id = 225400L,
aggregate_id = NULL, venue_id = 7189L, referee_id = NULL,
localteam_id = 18716L, visitorteam_id = 18658L, winner_team_id = NULL,
weather_report = NULL, commentaries = FALSE, attendance = NULL,
pitch = NULL, details = "Match 1", neutral_venue = FALSE,
winning_odds_calculated = FALSE, formations = list(localteam_formation = NULL,
visitorteam_formation = NULL), scores = list(localteam_score = 0L,
visitorteam_score = 0L, localteam_pen_score = NULL, visitorteam_pen_score = NULL,
ht_score = NULL, ft_score = NULL, et_score = NULL, ps_score = NULL),
time = list(status = "NS", starting_at = list(date_time = "2021-06-11 19:00:00",
date = "2021-06-11", time = "19:00:00", timestamp = 1623438000L,
timezone = "UTC"), minute = NULL, second = NULL, added_time = NULL,
extra_minute = NULL, injury_time = NULL), coaches = list(
localteam_coach_id = 455836L, visitorteam_coach_id = 784486L),
standings = list(localteam_position = 3L, visitorteam_position = 1L),
assistants = list(first_assistant_id = NULL, second_assistant_id = NULL,
fourth_official_id = NULL), leg = "1/1", colors = NULL,
deleted = FALSE, is_placeholder = FALSE, localTeam = list(
data = list(id = 18716L, legacy_id = 213L, name = "Turkey",
short_code = "TUR", twitter = NULL, country_id = 404L,
national_team = TRUE, founded = 1923L, logo_path = "https://cdn.sportmonks.com/images//soccer/teams/28/18716.png",
venue_id = 9634L, current_season_id = 15733L, is_placeholder = NULL)),
visitorTeam = list(data = list(id = 18658L, legacy_id = 205L,
name = "Italy", short_code = "ITA", twitter = NULL, country_id = 251L,
national_team = TRUE, founded = 1898L, logo_path = "https://cdn.sportmonks.com/images//soccer/teams/2/18658.png",
venue_id = 7189L, current_season_id = 15733L, is_placeholder = NULL))))
To flatten into a dataframe, we use:
zed <- fetch_results$data %>%
purrr::map(unlist) %>%
purrr::map(t) %>%
purrr::map(as_tibble) %>%
dplyr::bind_rows() %>%
readr::type_convert()
One row of our dataframe output looks like this:
If you look closer at the list of lists, there are many objects with value NULL that are dropped in the main dataframe. The entire score
list with all of its keys is dropped. According to this stackoverflow post, it looks like unlist() drops the NULL values is the culprit...
The solutions posted in that thread only address NULL values that are 1-layer nested deep, however the list above has many nested lists which you can see if you search for list( above.
What is the best way to flatten this list of lists without dropping any of the columns with NULL values? If the best way is to first replace NULLs with NAs, what is the best way to do this? Our existing code does the flattening and gets close, but doesn't keep the columns with NULLs.
Upvotes: 2
Views: 629
Reputation: 887173
An option is to convert to NA
before we do anything. This can be done in a recursive way with rrapply
library(rrapply)
library(purrr)
library(dplyr)
zed <- rrapply(fetch_results$data, f = function(x)
replace(x, is.null(x), NA)) %>%
map(unlist) %>%
map(t) %>%
map(as_tibble) %>%
bind_rows() %>%
type.convert(as.is = TRUE)
-output
# A tibble: 1 x 75
id league_id season_id stage_id round_id group_id aggregate_id venue_id referee_id localteam_id visitorteam_id winner_team_id weather_report
<int> <int> <int> <int> <int> <int> <lgl> <int> <lgl> <int> <int> <lgl> <lgl>
1 11984409 1326 15733 77442469 186274 225400 NA 7189 NA 18716 18658 NA NA
# … with 62 more variables: commentaries <lgl>, attendance <lgl>, pitch <lgl>, details <chr>, neutral_venue <lgl>, winning_odds_calculated <lgl>,
# formations.localteam_formation <lgl>, formations.visitorteam_formation <lgl>, scores.localteam_score <int>, scores.visitorteam_score <int>,
# scores.localteam_pen_score <lgl>, scores.visitorteam_pen_score <lgl>, scores.ht_score <lgl>, scores.ft_score <lgl>, scores.et_score <lgl>,
# scores.ps_score <lgl>, time.status <chr>, time.starting_at.date_time <chr>, time.starting_at.date <chr>, time.starting_at.time <chr>,
# time.starting_at.timestamp <int>, time.starting_at.timezone <chr>, time.minute <lgl>, time.second <lgl>, time.added_time <lgl>, time.extra_minute <lgl>,
# time.injury_time <lgl>, coaches.localteam_coach_id <int>, coaches.visitorteam_coach_id <int>, standings.localteam_position <int>,
# standings.visitorteam_position <int>, assistants.first_assistant_id <lgl>, assistants.second_assistant_id <lgl>, assistants.fourth_official_id <lgl>,
# leg <chr>, colors <lgl>, deleted <lgl>, is_placeholder <lgl>, localTeam.data.id <int>, localTeam.data.legacy_id <int>, localTeam.data.name <chr>,
# localTeam.data.short_code <chr>, localTeam.data.twitter <lgl>, localTeam.data.country_id <int>, localTeam.data.national_team <lgl>,
# localTeam.data.founded <int>, localTeam.data.logo_path <chr>, localTeam.data.venue_id <int>, localTeam.data.current_season_id <int>,
# localTeam.data.is_placeholder <lgl>, visitorTeam.data.id <int>, visitorTeam.data.legacy_id <int>, visitorTeam.data.name <chr>,
# visitorTeam.data.short_code <chr>, visitorTeam.data.twitter <lgl>, visitorTeam.data.country_id <int>, visitorTeam.data.national_team <lgl>,
# visitorTeam.data.founded <int>, visitorTeam.data.logo_path <chr>, visitorTeam.data.venue_id <int>, visitorTeam.data.current_season_id <int>,
# visitorTeam.data.is_placeholder <lgl>
Upvotes: 3