mat4nier
mat4nier

Reputation: 262

Map over json in R and create dataframe combining multiple list levels [refactoring advice]

I am mapping over a series of entries at level(x) of a json. For each level x, there are nested levels (x+1) containing some information that I want to combine into a data frame along with some information from x.

This is a toy example I'm using to learn purrr and handling json in R.

E.g.

(entry) <- level x 
   (year: 2016)         <- want this 
   (category: "physics") <- want this
       (winners)  
            (1) <- level x+1
               (name: "bob" ) <- want this 
               (id: ) <- want this 
            (2..n) <- level x+1 
               (name: "steve" ) <- want this 
               (id: ) <- want this 

To make a dataframe:

 name id year category 
 bob   1  2016 physics 
 steve 2  2016 physics
 mel   3  2016 chemistry .. etc

I have this solved but it's using a nested map on every level of x and is very brittle:

 library(purr)
 library(tidyverse)
 library(stringr)
 library(jsonlite)
 # get example data 
 winners <- fromJSON("http://api.nobelprize.org/v1/prize.json", simplifyDataFrame=FALSE) 


 x <- winners$prizes %>%
          map_df(function(prize) {
              map_df(prize$laureates, function(person) {
                   tibble(id = person$id, firstname = person$firstname, 
                         surname=ifelse(!is.null(person$surname),
                              person$surname, NA),
          category=prize$category, year=prize$year)
  })
}) 

Is there a better way to be doing this? Concerns w/ above code:

Upvotes: 0

Views: 953

Answers (1)

hrbrmstr
hrbrmstr

Reputation: 78792

What you did was — as they say in New England — perfectly fine, esp since it resulted in a working solution that was readable by other folks (i.e. the two most important things).

This is the approach I'd take (it's only slightly different):

winners <- fromJSON("http://api.nobelprize.org/v1/prize.json", simplifyDataFrame=FALSE) 

extract_laureates <- function(x) {

  surname <- NULL

  map_df(x$laureates, flatten_df) %>% 
    mutate(name=paste(firstname, surname, sep=" "),
           year=x$year, 
           category=x$category) %>% 
    select(name, id, year, category)

}

map_df(winners$prizes, extract_laureates)
## # A tibble: 911 × 4
##                      name    id  year   category
##                     <chr> <chr> <chr>      <chr>
## 1       David J. Thouless   928  2016    physics
## 2    F. Duncan M. Haldane   929  2016    physics
## 3   J. Michael Kosterlitz   930  2016    physics
## 4     Jean-Pierre Sauvage   931  2016  chemistry
## 5  Sir J. Fraser Stoddart   932  2016  chemistry
## 6      Bernard L. Feringa   933  2016  chemistry
## 7        Yoshinori Ohsumi   927  2016   medicine
## 8               Bob Dylan   937  2016 literature
## 9      Juan Manuel Santos   934  2016      peace
## 10            Oliver Hart   935  2016  economics
## # ... with 901 more rows

Unless I'm writing a quick hack that I'm pretty sure I'll never use again, I like to make non-anonymous functions since it helps when breaking down the logic/steps.

You can use the scoping rules of R to simplify the ifelse() by declaring a variable with the same name as the column. If dplyr finds a column with that name it'll use it. If not, R will use the local variable.

Then, we add the year and category to the new data_frame and select() out what you wanted.

To address your specific questions:

  1. I'm not sure what you mean by "brittle". You handle the edge case(s) and most JSON requires specialized methods for proper extraction.
  2. I don't see a way to do this w/o two map…() calls. Even if I could come up with one, it'd prbly look like an ugly, unreadable hack (remember, you're writing code for humans).
  3. Missing keys q was covered in the above exposit.

Another option is to wait until the "filled" data.frame is built then do the name processing:

extract_laureates <- function(x) {
  map_df(x$laureates, flatten_df) %>% 
    mutate(year=x$year, category=x$category)
}

map_df(winners$prizes, extract_laureates) %>% 
  mutate(surname=ifelse(is.na(surname), I(NULL), surname),
         name=paste(firstname, surname, sep=" ")) %>% 
  select(name, id, year, category) %>% View()

Upvotes: 1

Related Questions