Reputation: 243
This is a direct follow up to a previous and similar question I asked on extracting a specific subset of a list of lists: Extracting data from a list of lists into its own `data.frame` with `purrr`
Hence I will use the same sample dataset:
l <- list(structure(list(a = -1.54676469632688, b = "s", c = "T",
d = structure(list(id = 5L, label = "Utah", link = "Asia/Anadyr",
score = -0.21104594634643), .Names = c("id", "label", "link", "score")), e = 49.1279871269422), .Names = c("a", "b", "c", "d", "e")), structure(list(a = -0.934821052832427, b = "k", c = "T", d = list(structure(list(id = 8L, label = "South Carolina", link = "Pacific/Wallis", score = 0.526540892113734, externalId = -6.74354377676955), .Names = c("id", "label", "link", "score", "externalId")), structure(list(id = 9L, label = "Nebraska", link = "America/Scoresbysund", score = 0.250895465294041, externalId = 16.4257470807879), .Names = c("id", "label", "link", "score", "externalId"))), e = 52.3161400117052), .Names = c("a", "b", "c", "d", "e")), structure(list(a = -0.27261485993069, b = "f", c = "P", d = list(structure(list(id = 8L, label = "Georgia", link = "America/Nome", score = 0.526494135483816, externalId = 7.91583574935589), .Names = c("id", "label", "link", "score", "externalId")), structure(list(id = 2L, label = "Washington", link = "America/Shiprock", score = -0.555186440792989, externalId = 15.0686663219837), .Names = c("id", "label", "link", "score", "externalId")), structure(list(id = 6L, label = "North Dakota", link = "Universal", score = 1.03168296038975), .Names = c("id", "label", "link", "score")), structure(list(id = 1L, label = "New Hampshire", link = "America/Cordoba", score = 1.21582056168681, externalId = 9.7276418869132), .Names = c("id", "label", "link", "score", "externalId")), structure(list(id = 1L, label = "Alaska", link = "Asia/Istanbul", score = -0.23183264861979), .Names = c("id", "label", "link", "score")), structure(list(id = 4L, label = "Pennsylvania", link = "Africa/Dar_es_Salaam", score = 0.590245339334121), .Names = c("id", "label", "link", "score"))), e = 132.1153538536), .Names = c("a", "e")), structure(list(a = 0.202685974077313, b = "x", c = "O", d = structure(list(id = 3L, label = "Delaware", link = "Asia/Samarkand", score = 0.695577130634724, externalId = 15.2364820698193), .Names = c("id", "label", "link", "score", "externalId")), e = 97.9908914452971), .Names = c("a", "b", "c", "d", "e")), structure(list(a = -0.396243444741009, b = "z", c = "P", d = list(structure(list(id = 4L, label = "North Dakota", link = "America/Tortola", score = 1.03060272795705, externalId = -7.21666936522344), .Names = c("id", "label", "link", "score", "externalId")), structure(list(id = 9L, label = "Nebraska", link = "America/Ojinaga", score = -1.11397997280413, externalId = -8.45145052697411), .Names = c("id", "label", "link", "score", "externalId"))), e = 123.597945533926), .Names = c("a", "b", "c", "d", "e")))
The general issue I am trying to resolve is to extract contents of a nested list which are of varying lengths, and bind them to other contents within the same list which are essentially being used as IDs for the nested contents.
In the context of the above sample dataset, I am trying to extract the contents of the sublist d
into a data.table
/data.frame
, but also extract and essentially repeat the data in a
for each element -- so that I can understand which extracted elements in d
belong in the same subset, due to their differing lengths. An example of the desired data.table
will explain best:
a id label link score externalId
-1.5467647 5 Utah Asia/Anadyr -0.2110459 NA
-0.9348211 8 South Carolina Pacific/Wallis 0.5265409 -6.743544
-0.9348211 9 Nebraska America/Scoresbysund 0.2508955 16.42575
Note that the first column a
is the contents of the first sublist within l
. The first row is the content from the first nested item in d
(length 1), then the second and third row is the content from the second item in d
(length 2) hence the value in a
is the same -0.9348211
.
At present my solutions of accomplishing this are in a round-about fashion, and prone to error -- and given the relation to the referenced post above, I wonder if I am not understanding the solution to be able to extend it to this related problem.
Upvotes: 0
Views: 355
Reputation: 43354
Each nested list tends to require a slightly different approach, but this covers some typical ones:
library(tidyverse)
l <- list(structure(list(a = -1.54676469632688, b = "s", c = "T",
d = structure(list(id = 5L, label = "Utah", link = "Asia/Anadyr",
score = -0.21104594634643), .Names = c("id", "label", "link", "score")), e = 49.1279871269422), .Names = c("a", "b", "c", "d", "e")), structure(list(a = -0.934821052832427, b = "k", c = "T", d = list(structure(list(id = 8L, label = "South Carolina", link = "Pacific/Wallis", score = 0.526540892113734, externalId = -6.74354377676955), .Names = c("id", "label", "link", "score", "externalId")), structure(list(id = 9L, label = "Nebraska", link = "America/Scoresbysund", score = 0.250895465294041, externalId = 16.4257470807879), .Names = c("id", "label", "link", "score", "externalId"))), e = 52.3161400117052), .Names = c("a", "b", "c", "d", "e")), structure(list(a = -0.27261485993069, b = "f", c = "P", d = list(structure(list(id = 8L, label = "Georgia", link = "America/Nome", score = 0.526494135483816, externalId = 7.91583574935589), .Names = c("id", "label", "link", "score", "externalId")), structure(list(id = 2L, label = "Washington", link = "America/Shiprock", score = -0.555186440792989, externalId = 15.0686663219837), .Names = c("id", "label", "link", "score", "externalId")), structure(list(id = 6L, label = "North Dakota", link = "Universal", score = 1.03168296038975), .Names = c("id", "label", "link", "score")), structure(list(id = 1L, label = "New Hampshire", link = "America/Cordoba", score = 1.21582056168681, externalId = 9.7276418869132), .Names = c("id", "label", "link", "score", "externalId")), structure(list(id = 1L, label = "Alaska", link = "Asia/Istanbul", score = -0.23183264861979), .Names = c("id", "label", "link", "score")), structure(list(id = 4L, label = "Pennsylvania", link = "Africa/Dar_es_Salaam", score = 0.590245339334121), .Names = c("id", "label", "link", "score"))), e = 132.1153538536), .Names = c("a", "e")), structure(list(a = 0.202685974077313, b = "x", c = "O", d = structure(list(id = 3L, label = "Delaware", link = "Asia/Samarkand", score = 0.695577130634724, externalId = 15.2364820698193), .Names = c("id", "label", "link", "score", "externalId")), e = 97.9908914452971), .Names = c("a", "b", "c", "d", "e")), structure(list(a = -0.396243444741009, b = "z", c = "P", d = list(structure(list(id = 4L, label = "North Dakota", link = "America/Tortola", score = 1.03060272795705, externalId = -7.21666936522344), .Names = c("id", "label", "link", "score", "externalId")), structure(list(id = 9L, label = "Nebraska", link = "America/Ojinaga", score = -1.11397997280413, externalId = -8.45145052697411), .Names = c("id", "label", "link", "score", "externalId"))), e = 123.597945533926), .Names = c("a", "b", "c", "d", "e")))
l %>%
map(set_names, letters[1:5]) %>% # add missing names
map(modify_at, 'd', bind_rows) %>% # coerce nested elements to data.frame
# make each element to a data.frame, and rbind them all together
map_df(data.frame, stringsAsFactors = FALSE)
#> a b c d.id d.label d.link d.score e d.externalId
#> 1 -1.5467647 s T 5 Utah Asia/Anadyr -0.2110459 49.12799 NA
#> 2 -0.9348211 k T 8 South Carolina Pacific/Wallis 0.5265409 52.31614 -6.743544
#> 3 -0.9348211 k T 9 Nebraska America/Scoresbysund 0.2508955 52.31614 16.425747
#> 4 -0.2726149 f P 8 Georgia America/Nome 0.5264941 132.11535 7.915836
#> 5 -0.2726149 f P 2 Washington America/Shiprock -0.5551864 132.11535 15.068666
#> 6 -0.2726149 f P 6 North Dakota Universal 1.0316830 132.11535 NA
#> 7 -0.2726149 f P 1 New Hampshire America/Cordoba 1.2158206 132.11535 9.727642
#> 8 -0.2726149 f P 1 Alaska Asia/Istanbul -0.2318326 132.11535 NA
#> 9 -0.2726149 f P 4 Pennsylvania Africa/Dar_es_Salaam 0.5902453 132.11535 NA
#> 10 0.2026860 x O 3 Delaware Asia/Samarkand 0.6955771 97.99089 15.236482
#> 11 -0.3962434 z P 4 North Dakota America/Tortola 1.0306027 123.59795 -7.216669
#> 12 -0.3962434 z P 9 Nebraska America/Ojinaga -1.1139800 123.59795 -8.451451
There are many more ways to do this, but the key is to start by arranging the most nested elements into the proper data structure, and then combining them with the remaining elements until you have a data.frame.
Note that using data.frame
instead of a tibble equivalent is a little hacky here, but data.frame is much better at slurping up both data.frames and values into a single data.frame, recycling as necessary. Using a tidyverse version would require making everything the correct length instead of relying on recycling.
Upvotes: 4