AF7
AF7

Reputation: 3242

Extract multiple columns from list of lists, and save in data.frame

I have the following list:

library(rjson)
j <- fromJSON(file='https://esgf-data.dkrz.de/esg-search/search/?offset=0&limit=1000&type=Dataset&replica=false&latest=true&project=CORDEX&domain=EUR-11&experiment=rcp85&time_frequency=day&facets=rcm_name%2Cproject%2Cproduct%2Cdomain%2Cinstitute%2Cdriving_model%2Cexperiment%2Cexperiment_family%2Censemble%2Crcm_version%2Ctime_frequency%2Cvariable%2Cvariable_long_name%2Ccf_standard_name%2Cdata_node&format=application%2Fsolr%2Bjson')

I am interested in extracting data from this component: j$response$docs, which is a list of lists. The 'internal' lists are all supposed to have the same names.

I want to save the output to a data.frame() or tibble().

This below works and gives the desired output, for the few selected variables:

nmod <- length(j$response$docs)
for (i in 1:nmod) {
    #select one list at a time
    j1 <- j$response$docs[[i]]
    tmp <- data.frame(variable=j1$variable,
                        variable_long_name=j1$variable_long_name,
                        rcm_name=j1$rcm_name,
                        driving_model=j1$driving_model,
                        cf_standard_name=j1$cf_standard_name
                        )
    #join them
    if (i==1) {
        d <- tmp
    } else {
        d <- rbind(d, tmp)
    }
}

However, I'd like to know if there is a more elegant and efficient way, maybe using tidyr, dplyr or purrr, which also would allow me to select all ¨columns¨, instead of just the few selected there.

Upvotes: 0

Views: 1930

Answers (2)

aosmith
aosmith

Reputation: 36084

You can do it with help from package purrr. I thought at_depth might work here, but instead I ended up using nested map_df.

library(purrr)

Your variables are different lengths, so the first thing to do is to make sure each variable is length 1. This can be done by collapsing each element of the inner list with paste. I used commas a separator. Doing this via map_df returns a 1 row tibble.

Here's an example with the first inner list.

map_df(j$response$docs[[1]], paste, collapse = ",")

Now we can loop through the outer lists, making a 1 row tibble for each. We use map_df to bind each of these together. The output is a 832 row tibble, one row per list. I used the .id argument to add a grouping variable to the result, which may not be needed.

d1 = map_df(j$response$docs, ~map_df(.x, paste, collapse = ","))
d1

# A tibble: 832 × 45
   group                                                                                                   id  version
   <chr>                                                                                                <chr>    <chr>
1      1   cordex.output.EUR-11.DMI.ICHEC-EC-EARTH.rcp85.r3i1p1.HIRHAM5.v1.day.clh.v20131119|cordexesg.dmi.dk 20131119
2      2 cordex.output.EUR-11.DMI.ICHEC-EC-EARTH.rcp85.r3i1p1.HIRHAM5.v1.day.clivi.v20131119|cordexesg.dmi.dk 20131119
3      3  cordex.output.EUR-11.DMI.ICHEC-EC-EARTH.rcp85.r3i1p1.HIRHAM5.v1.day.rsds.v20131119|cordexesg.dmi.dk 20131119
4      4  cordex.output.EUR-11.DMI.ICHEC-EC-EARTH.rcp85.r3i1p1.HIRHAM5.v1.day.rlds.v20131119|cordexesg.dmi.dk 20131119
5      5  cordex.output.EUR-11.DMI.ICHEC-EC-EARTH.rcp85.r3i1p1.HIRHAM5.v1.day.rsus.v20131119|cordexesg.dmi.dk 20131119
6      6  cordex.output.EUR-11.DMI.ICHEC-EC-EARTH.rcp85.r3i1p1.HIRHAM5.v1.day.rlus.v20131119|cordexesg.dmi.dk 20131119
7      7  cordex.output.EUR-11.DMI.ICHEC-EC-EARTH.rcp85.r3i1p1.HIRHAM5.v1.day.rsdt.v20131119|cordexesg.dmi.dk 20131119
8      8  cordex.output.EUR-11.DMI.ICHEC-EC-EARTH.rcp85.r3i1p1.HIRHAM5.v1.day.rsut.v20131119|cordexesg.dmi.dk 20131119
9      9  cordex.output.EUR-11.DMI.ICHEC-EC-EARTH.rcp85.r3i1p1.HIRHAM5.v1.day.rlut.v20131119|cordexesg.dmi.dk 20131119
10    10   cordex.output.EUR-11.DMI.ICHEC-EC-EARTH.rcp85.r3i1p1.HIRHAM5.v1.day.psl.v20131119|cordexesg.dmi.dk 20131119
# ... with 822 more rows, and 42 more variables:

If you want to get multiple rows for the variables that were greater than length 1, such as access and experiment_family, you can use tidyr::separate_rows to separate the data onto multiple rows.

tidyr::separate_rows(d1, experiment_family)

Upvotes: 2

Carl Boneri
Carl Boneri

Reputation: 2722

instead of rjson go with this:

library(jsonlite)
j <- jsonlite::fromJSON('https://esgf-data.dkrz.de/esg-search/search/?offset=0&limit=1000&type=Dataset&replica=false&latest=true&project=CORDEX&domain=EUR-11&experiment=rcp85&time_frequency=day&facets=rcm_name%2Cproject%2Cproduct%2Cdomain%2Cinstitute%2Cdriving_model%2Cexperiment%2Cexperiment_family%2Censemble%2Crcm_version%2Ctime_frequency%2Cvariable%2Cvariable_long_name%2Ccf_standard_name%2Cdata_node&format=application%2Fsolr%2Bjson')

# The names you wan to find in the nested returned data
look_for <- c('variable','variable_long_name' ,
              'rcm_name','driving_model',
              'cf_standard_name')


new_df <- as.data.frame(sapply(look_for, function(i){
  unlist(j$response$docs[[i]])
}))

str(new_df)
'data.frame':   832 obs. of  5 variables:
$ variable          : chr  "clh" "clivi" "rsds" "rlds" ...
$ variable_long_name: chr  "High Level Cloud Fraction" "Ice Water Path" "Surface Downwelling Shortwave Radiation" "Surface Downwelling Longwave Radiation" ...
$ rcm_name          : chr  "HIRHAM5" "HIRHAM5" "HIRHAM5" "HIRHAM5" ...
$ driving_model     : chr  "ICHEC-EC-EARTH" "ICHEC-EC-EARTH" "ICHEC-EC-EARTH" "ICHEC-EC-EARTH" ...
$ cf_standard_name  : chr  "cloud_area_fraction_in_atmosphere_layer" "atmosphere_cloud_ice_content" "surface_downwelling_shortwave_flux_in_air" "surface_downwelling_longwave_flux_in_air" ...

Upvotes: 0

Related Questions