daRknight
daRknight

Reputation: 243

Extracting data from hierarchical lists of different lengths into `data.frame` using `purrr`

This is a direct follow up to a previous and similar question I asked on extracting a specific subset of a list of lists: Extracting data from a list of lists into its own `data.frame` with `purrr`

Hence I will use the same sample dataset:

l <- list(structure(list(a = -1.54676469632688, b = "s", c = "T", 
                     d = structure(list(id = 5L, label = "Utah", link = "Asia/Anadyr",
                                        score = -0.21104594634643), .Names = c("id", "label", "link", "score")), e = 49.1279871269422), .Names = c("a", "b", "c", "d", "e")), structure(list(a = -0.934821052832427, b = "k", c = "T", d = list(structure(list(id = 8L, label = "South Carolina", link = "Pacific/Wallis", score = 0.526540892113734, externalId = -6.74354377676955), .Names = c("id", "label", "link", "score", "externalId")), structure(list(id = 9L, label = "Nebraska", link = "America/Scoresbysund", score = 0.250895465294041, externalId = 16.4257470807879), .Names = c("id", "label", "link", "score", "externalId"))), e = 52.3161400117052), .Names = c("a", "b", "c", "d", "e")), structure(list(a = -0.27261485993069, b = "f", c = "P", d = list(structure(list(id = 8L, label = "Georgia", link = "America/Nome", score = 0.526494135483816, externalId = 7.91583574935589), .Names = c("id", "label", "link", "score", "externalId")), structure(list(id = 2L, label = "Washington", link = "America/Shiprock", score = -0.555186440792989, externalId = 15.0686663219837), .Names = c("id", "label", "link", "score", "externalId")), structure(list(id = 6L, label = "North Dakota", link = "Universal", score = 1.03168296038975), .Names = c("id", "label", "link", "score")), structure(list(id = 1L, label = "New Hampshire", link = "America/Cordoba", score = 1.21582056168681, externalId = 9.7276418869132), .Names = c("id", "label", "link", "score", "externalId")), structure(list(id = 1L, label = "Alaska", link = "Asia/Istanbul", score = -0.23183264861979), .Names = c("id", "label", "link", "score")), structure(list(id = 4L, label = "Pennsylvania", link = "Africa/Dar_es_Salaam", score = 0.590245339334121), .Names = c("id", "label", "link", "score"))), e = 132.1153538536), .Names = c("a", "e")), structure(list(a = 0.202685974077313, b = "x", c = "O", d = structure(list(id = 3L, label = "Delaware", link = "Asia/Samarkand", score = 0.695577130634724, externalId = 15.2364820698193), .Names = c("id", "label", "link", "score", "externalId")), e = 97.9908914452971), .Names = c("a", "b", "c", "d", "e")), structure(list(a = -0.396243444741009, b = "z", c = "P", d = list(structure(list(id = 4L, label = "North Dakota", link = "America/Tortola", score = 1.03060272795705, externalId = -7.21666936522344), .Names = c("id", "label", "link", "score", "externalId")), structure(list(id = 9L, label = "Nebraska", link = "America/Ojinaga", score = -1.11397997280413, externalId = -8.45145052697411), .Names = c("id", "label", "link", "score", "externalId"))), e = 123.597945533926), .Names = c("a", "b", "c", "d", "e")))

The general issue I am trying to resolve is to extract contents of a nested list which are of varying lengths, and bind them to other contents within the same list which are essentially being used as IDs for the nested contents.

In the context of the above sample dataset, I am trying to extract the contents of the sublist d into a data.table/data.frame, but also extract and essentially repeat the data in a for each element -- so that I can understand which extracted elements in d belong in the same subset, due to their differing lengths. An example of the desired data.table will explain best:

a          id           label                        link       score  externalId
-1.5467647  5            Utah                 Asia/Anadyr  -0.2110459          NA
-0.9348211  8  South Carolina              Pacific/Wallis   0.5265409   -6.743544
-0.9348211  9        Nebraska        America/Scoresbysund   0.2508955    16.42575

Note that the first column a is the contents of the first sublist within l. The first row is the content from the first nested item in d (length 1), then the second and third row is the content from the second item in d (length 2) hence the value in a is the same -0.9348211.

At present my solutions of accomplishing this are in a round-about fashion, and prone to error -- and given the relation to the referenced post above, I wonder if I am not understanding the solution to be able to extend it to this related problem.

Upvotes: 0

Views: 355

Answers (1)

alistaire
alistaire

Reputation: 43354

Each nested list tends to require a slightly different approach, but this covers some typical ones:

library(tidyverse)

l <- list(structure(list(a = -1.54676469632688, b = "s", c = "T", 
                     d = structure(list(id = 5L, label = "Utah", link = "Asia/Anadyr",
                                        score = -0.21104594634643), .Names = c("id", "label", "link", "score")), e = 49.1279871269422), .Names = c("a", "b", "c", "d", "e")), structure(list(a = -0.934821052832427, b = "k", c = "T", d = list(structure(list(id = 8L, label = "South Carolina", link = "Pacific/Wallis", score = 0.526540892113734, externalId = -6.74354377676955), .Names = c("id", "label", "link", "score", "externalId")), structure(list(id = 9L, label = "Nebraska", link = "America/Scoresbysund", score = 0.250895465294041, externalId = 16.4257470807879), .Names = c("id", "label", "link", "score", "externalId"))), e = 52.3161400117052), .Names = c("a", "b", "c", "d", "e")), structure(list(a = -0.27261485993069, b = "f", c = "P", d = list(structure(list(id = 8L, label = "Georgia", link = "America/Nome", score = 0.526494135483816, externalId = 7.91583574935589), .Names = c("id", "label", "link", "score", "externalId")), structure(list(id = 2L, label = "Washington", link = "America/Shiprock", score = -0.555186440792989, externalId = 15.0686663219837), .Names = c("id", "label", "link", "score", "externalId")), structure(list(id = 6L, label = "North Dakota", link = "Universal", score = 1.03168296038975), .Names = c("id", "label", "link", "score")), structure(list(id = 1L, label = "New Hampshire", link = "America/Cordoba", score = 1.21582056168681, externalId = 9.7276418869132), .Names = c("id", "label", "link", "score", "externalId")), structure(list(id = 1L, label = "Alaska", link = "Asia/Istanbul", score = -0.23183264861979), .Names = c("id", "label", "link", "score")), structure(list(id = 4L, label = "Pennsylvania", link = "Africa/Dar_es_Salaam", score = 0.590245339334121), .Names = c("id", "label", "link", "score"))), e = 132.1153538536), .Names = c("a", "e")), structure(list(a = 0.202685974077313, b = "x", c = "O", d = structure(list(id = 3L, label = "Delaware", link = "Asia/Samarkand", score = 0.695577130634724, externalId = 15.2364820698193), .Names = c("id", "label", "link", "score", "externalId")), e = 97.9908914452971), .Names = c("a", "b", "c", "d", "e")), structure(list(a = -0.396243444741009, b = "z", c = "P", d = list(structure(list(id = 4L, label = "North Dakota", link = "America/Tortola", score = 1.03060272795705, externalId = -7.21666936522344), .Names = c("id", "label", "link", "score", "externalId")), structure(list(id = 9L, label = "Nebraska", link = "America/Ojinaga", score = -1.11397997280413, externalId = -8.45145052697411), .Names = c("id", "label", "link", "score", "externalId"))), e = 123.597945533926), .Names = c("a", "b", "c", "d", "e")))

l %>% 
    map(set_names, letters[1:5]) %>%    # add missing names
    map(modify_at, 'd', bind_rows) %>%    # coerce nested elements to data.frame
    # make each element to a data.frame, and rbind them all together
    map_df(data.frame, stringsAsFactors = FALSE)
#>             a b c d.id        d.label               d.link    d.score         e d.externalId
#> 1  -1.5467647 s T    5           Utah          Asia/Anadyr -0.2110459  49.12799           NA
#> 2  -0.9348211 k T    8 South Carolina       Pacific/Wallis  0.5265409  52.31614    -6.743544
#> 3  -0.9348211 k T    9       Nebraska America/Scoresbysund  0.2508955  52.31614    16.425747
#> 4  -0.2726149 f P    8        Georgia         America/Nome  0.5264941 132.11535     7.915836
#> 5  -0.2726149 f P    2     Washington     America/Shiprock -0.5551864 132.11535    15.068666
#> 6  -0.2726149 f P    6   North Dakota            Universal  1.0316830 132.11535           NA
#> 7  -0.2726149 f P    1  New Hampshire      America/Cordoba  1.2158206 132.11535     9.727642
#> 8  -0.2726149 f P    1         Alaska        Asia/Istanbul -0.2318326 132.11535           NA
#> 9  -0.2726149 f P    4   Pennsylvania Africa/Dar_es_Salaam  0.5902453 132.11535           NA
#> 10  0.2026860 x O    3       Delaware       Asia/Samarkand  0.6955771  97.99089    15.236482
#> 11 -0.3962434 z P    4   North Dakota      America/Tortola  1.0306027 123.59795    -7.216669
#> 12 -0.3962434 z P    9       Nebraska      America/Ojinaga -1.1139800 123.59795    -8.451451

There are many more ways to do this, but the key is to start by arranging the most nested elements into the proper data structure, and then combining them with the remaining elements until you have a data.frame.

Note that using data.frame instead of a tibble equivalent is a little hacky here, but data.frame is much better at slurping up both data.frames and values into a single data.frame, recycling as necessary. Using a tidyverse version would require making everything the correct length instead of relying on recycling.

Upvotes: 4

Related Questions