list of lists of matrices (from JSON) into single data.frame - purrr has problems with differing row numbers?

Question

I'm trying to use the information contained in keyed JSON names to add context to the data contained in their nested matrices. The matrices have different numbers of rows, and some of the matrices are missing (list element NULL). I am able to extract the relevant data and retain information as list names from the hierarchy using map and at_depth from the purrr package, but I cannot find a clean way to get this into a single data.frame.

I have attempted to use purrr:::transpose as exemplified here, and I've tried using tidyr:::unnest as shown here, but I think their desired results and inputs differ enough from mine that they are not applicable. There seems to be too many problems with the differing row names and/or the missing matrices. I am also new to the purrr package, so there could be something simple that I'm missing here.

Here is my own attempt which produces nearly the desired result, and I think I could modify it a bit more to remove the for loop and have another layer of some 'apply' functions, but I have the suspicion that there are better ways to go about this.

Minimal reproducible Example

#Download data
json <- getURL("http://maps2.dnr.state.mn.us/cgi-bin/lakefinder/detail.cgi?type=lake_survey&id=69070100")
#Surveys are the relevant data
data.listed <- fromJSON(json, simplifyDataFrame=F)
surveys <- data.listed$result$surveys

#Get list of lists of matrices - fish size count data
fcounts <- map(surveys, "lengths") %>% 
  at_depth(2, "fishCount") %>%
  at_depth(2, data.frame) # side note: is this a good way to the inner matrices to data.frames?
#top-level - list - surveys 
   #2nd-level - list - species in each survey
      #3rd-level - data.frame - X1: measured_size, X2: counts
#use survey IDs as names for top level list
#just as species are used as names for 2nd level lists
names(fcounts) <- sapply(surveys, function(s) {return(s$surveyID)})

#This produces nearly the correct result

for (i in 1:length(fcounts)){
  surv.id <- names(fcounts)[[i]]
  if (length(fcounts[[i]]) > 0) {
    listed.withSpecies <- lapply(names(fcounts[[i]]), function(species) cbind(fcounts[[i]][[species]], species))
    surv.fishCounts <- do.call(rbind, listed.withSpecies)
    colnames(surv.fishCounts) <- c("size", "count", "species")
    surv.fishCounts$survey.ID <- surv.id
    print(surv.fishCounts)
  }
}

hrbrmstr · Accepted Answer

This is one way to get nested data frames of the lengths counts into a big data frame:

library(httr)
library(tidyverse)

res <- GET("http://maps2.dnr.state.mn.us/cgi-bin/lakefinder/detail.cgi",
           query = list(type="lake_survey", id="69070100"))

content(res, as="text") %>%
  jsonlite::fromJSON(simplifyDataFrame = FALSE, flatten=FALSE) -> x

x$result$surveys %>%
  map_df(~{
    tmp_df <- flatten_df(.x[c("surveyDate", "surveyID", "surveyType", "surveySubType")])
    lens <- .x$lengths
    if (length(lens) > 0) {
      fish <- names(lens)
      data_frame(fish,
                 max_length = map_dbl(lens, "maximum_length"),
                 min_length = map_dbl(lens, "minimum_length"),
                 lens = map(lens, "fishCount") %>%
                   map(~set_names(as_data_frame(.), c("catch_len", "ct"))))  %>%
        mutate(surveyDate = tmp_df$surveyDate,
               surveyType = tmp_df$surveyType,
               surveySubType = tmp_df$surveySubType,
               surveyID = tmp_df$surveyID) -> tmp_df
    }
    tmp_df
  }) -> lengths_df

glimpse(lengths_df)
## Observations: 21
## Variables: 8
## $ surveyDate     "1988-07-19", "1995-07-17", "1995-07-17", "1995-07-17", "1995-07-17", "1995-07-17", "1995-07-...
## $ surveyID       "107278", "107539", "107539", "107539", "107539", "107539", "107539", "107539", "107539", "10...
## $ surveyType     "Standard Survey", "Standard Survey", "Standard Survey", "Standard Survey", "Standard Survey"...
## $ surveySubType  "Population Assessment", "Re-Survey", "Re-Survey", "Re-Survey", "Re-Survey", "Re-Survey", "Re...
## $ fish           NA, "PMK", "BLB", "LMB", "YEP", "BLG", "WTS", "WAE", "NOP", "GSF", "BLC", NA, "HSF", "PMK", "...
## $ max_length     NA, 6, 12, 16, 6, 7, 18, 18, 36, 4, 10, NA, 8, 7, 12, 12, 6, 8, 23, 38, 12
## $ min_length     NA, 3, 10, 1, 3, 3, 16, 16, 6, 4, 4, NA, 7, 4, 10, 12, 5, 3, 12, 9, 7
## $ lens           [NULL, , ,                                                          
## 1  1988-07-19   107278 Standard Survey Population Assessment           NA         NA            
## 2  1995-07-17   107539 Standard Survey             Re-Survey   PMK          6          3  
## 3  1995-07-17   107539 Standard Survey             Re-Survey   BLB         12         10  
## 4  1995-07-17   107539 Standard Survey             Re-Survey   LMB         16          1  
## 5  1995-07-17   107539 Standard Survey             Re-Survey   YEP          6          3  
## 6  1995-07-17   107539 Standard Survey             Re-Survey   BLG          7          3  
## 7  1995-07-17   107539 Standard Survey             Re-Survey   WTS         18         16  
## 8  1995-07-17   107539 Standard Survey             Re-Survey   WAE         18         16  
## 9  1995-07-17   107539 Standard Survey             Re-Survey   NOP         36          6 
## 10 1995-07-17   107539 Standard Survey             Re-Survey   GSF          4          4  
## 11 1995-07-17   107539 Standard Survey             Re-Survey   BLC         10          4  
## 12 1992-07-24   107587 Standard Survey             Re-Survey           NA         NA            
## 13 2005-07-11   107906 Standard Survey Population Assessment   HSF          8          7  
## 14 2005-07-11   107906 Standard Survey Population Assessment   PMK          7          4  
## 15 2005-07-11   107906 Standard Survey Population Assessment   BLB         12         10  
## 16 2005-07-11   107906 Standard Survey Population Assessment   LMB         12         12  
## 17 2005-07-11   107906 Standard Survey Population Assessment   YEP          6          5  
## 18 2005-07-11   107906 Standard Survey Population Assessment   BLG          8          3  
## 19 2005-07-11   107906 Standard Survey Population Assessment   WAE         23         12  
## 20 2005-07-11   107906 Standard Survey Population Assessment   NOP         38          9 
## 21 2005-07-11   107906 Standard Survey Population Assessment   BLC         12          7

You can expand the nested catch observations this way:

filter(lengths_df, !map_lgl(lens, is.null)) %>%
  unnest(lens)
## # A tibble: 98 × 9
##    surveyDate surveyID      surveyType surveySubType  fish max_length min_length catch_len    ct
##                                                    
## 1  1995-07-17   107539 Standard Survey     Re-Survey   PMK          6          3         3     1
## 2  1995-07-17   107539 Standard Survey     Re-Survey   PMK          6          3         6     3
## 3  1995-07-17   107539 Standard Survey     Re-Survey   BLB         12         10        10     1
## 4  1995-07-17   107539 Standard Survey     Re-Survey   BLB         12         10        11     1
## 5  1995-07-17   107539 Standard Survey     Re-Survey   BLB         12         10        12     4
## 6  1995-07-17   107539 Standard Survey     Re-Survey   LMB         16          1         1     1
## 7  1995-07-17   107539 Standard Survey     Re-Survey   LMB         16          1        16     1
## 8  1995-07-17   107539 Standard Survey     Re-Survey   LMB         16          1         2     6
## 9  1995-07-17   107539 Standard Survey     Re-Survey   LMB         16          1         4     4
## 10 1995-07-17   107539 Standard Survey     Re-Survey   LMB         16          1         5     2
## # ... with 88 more rows

list of lists of matrices (from JSON) into single data.frame - purrr has problems with differing row numbers?

Minimal reproducible Example

Answers (1)

Related Questions