Reputation: 5335

How can I extract elements from a list of lists based on element type instead of name or position?

I have a list of lists (of lists of lists...it's lists all the way down) called geos with geolocation information for U.S. cities returned by the Google Maps API using the geocode() function in ggmaps (see dput at the bottom of this question for a representative sample of data on 10 cities).

I would now like to use bits of this list to populate a data frame with one row per location, i.e., per element of the vector of locations used in the API query. For argument's sake, let's say I wanted the resulting data frame to include columns for locality, administrative_area_level_2 (county), and administrative_area_level_1 (state), using long names for the first two and the short name for the last. Here's how the desired result would look.

            locality administrative_area_level_2 administrative_area_level_1
1          Franconia              Grafton County                          NH
2             Wausau             Marathon County                          WI
3         Northfield             Franklin County                          MA
4         South Bend           St. Joseph County                          IN
5          Lanesboro             Fillmore County                          MN
6          Cheboygan            Cheboygan County                          MI
7         Chelmsford            Middlesex County                          MA
8  Saint Clairsville              Belmont County                          OH
9      New Hyde Park               Nassau County                          NY
10         Jefferson                 Ashe County                          NC

All of the elements I want are in the address_components sub-list, which I can isolate as follows.

library(dplyr)
library(purrr)

address_components <- geos %>%
  map("results") %>%
  map(1) %>%
  map("address_components")

The tricky bit is that the resulting lists (now items 1 thru 10 in that new list called address_components) have varying lengths; the elements of those lists aren't named; and the position of the bits I want changes with list length. Instead of names for the list elements, we have (of course) a list within each list element called types that describes what that element is. So, for example, county might be the 2nd or 3rd or 4th element of address_components, and wherever it is, we can recognize it because the types sublist at that position includes the string "administrative_area_level_2" as one of its elements.

Is there a way programmatically to extract certain elements from that list based on these attributes of other elements at their level? In pseudocode, to get the county name, for example, I'd write something like...

if ("administrative_area_level_2" %in% unlist(types)) return long_name

So how can I actually do this in R? Is there some SQL-driven solution to this problem? Or can it be done in the tidyverse with some clever application of purrr functionality?

As promised, here is a sample of the list I'm working with.

geos <- list(list(results = list(list(address_components = list(list(
    long_name = "Franconia", short_name = "Franconia", types = list(
        "locality", "political")), list(long_name = "Grafton County", 
    short_name = "Grafton County", types = list("administrative_area_level_2", 
        "political")), list(long_name = "New Hampshire", short_name = "NH", 
    types = list("administrative_area_level_1", "political")), 
    list(long_name = "United States", short_name = "US", types = list(
        "country", "political"))), formatted_address = "Franconia, NH, USA", 
    geometry = list(bounds = list(northeast = list(lat = 44.2531679, 
        lng = -71.537367), southwest = list(lat = 44.112035, 
        lng = -71.786752)), location = list(lat = 44.2271729, 
        lng = -71.7479075), location_type = "APPROXIMATE", viewport = list(
        northeast = list(lat = 44.2531679, lng = -71.537367), 
        southwest = list(lat = 44.112035, lng = -71.786752))), 
    place_id = "ChIJo86bzAl8tEwRtSTsEBwg1Gc", types = list("locality", 
        "political"))), status = "OK"), list(results = list(list(
    address_components = list(list(long_name = "Wausau", short_name = "Wausau", 
        types = list("locality", "political")), list(long_name = "Marathon County", 
        short_name = "Marathon County", types = list("administrative_area_level_2", 
            "political")), list(long_name = "Wisconsin", short_name = "WI", 
        types = list("administrative_area_level_1", "political")), 
        list(long_name = "United States", short_name = "US", 
            types = list("country", "political"))), formatted_address = "Wausau, WI, USA", 
    geometry = list(bounds = list(northeast = list(lat = 45.006429, 
        lng = -89.573319), southwest = list(lat = 44.918368, 
        lng = -89.7482299)), location = list(lat = 44.9591352, 
        lng = -89.6301221), location_type = "APPROXIMATE", viewport = list(
        northeast = list(lat = 45.006429, lng = -89.573319), 
        southwest = list(lat = 44.918368, lng = -89.7482299))), 
    place_id = "ChIJg0go-J0nAIgRXIvo6NhaKQM", types = list("locality", 
        "political"))), status = "OK"), list(results = list(list(
    address_components = list(list(long_name = "Northfield", 
        short_name = "Northfield", types = list("locality", "political")), 
        list(long_name = "Franklin County", short_name = "Franklin County", 
            types = list("administrative_area_level_2", "political")), 
        list(long_name = "Massachusetts", short_name = "MA", 
            types = list("administrative_area_level_1", "political")), 
        list(long_name = "United States", short_name = "US", 
            types = list("country", "political"))), formatted_address = "Northfield, MA, USA", 
    geometry = list(bounds = list(northeast = list(lat = 42.7285309, 
        lng = -72.377039), southwest = list(lat = 42.604405, 
        lng = -72.5167739)), location = list(lat = 42.6959093, 
        lng = -72.4528885), location_type = "APPROXIMATE", viewport = list(
        northeast = list(lat = 42.7285309, lng = -72.377039), 
        southwest = list(lat = 42.604405, lng = -72.5167739))), 
    place_id = "ChIJ736z8Aw84YkRj0BUEm0QZgE", types = list("locality", 
        "political"))), status = "OK"), list(results = list(list(
    address_components = list(list(long_name = "South Bend", 
        short_name = "South Bend", types = list("locality", "political")), 
        list(long_name = "Portage Township", short_name = "Portage Township", 
            types = list("administrative_area_level_3", "political")), 
        list(long_name = "St. Joseph County", short_name = "St Joseph County", 
            types = list("administrative_area_level_2", "political")), 
        list(long_name = "Indiana", short_name = "IN", types = list(
            "administrative_area_level_1", "political")), list(
            long_name = "United States", short_name = "US", types = list(
                "country", "political"))), formatted_address = "South Bend, IN, USA", 
    geometry = list(bounds = list(northeast = list(lat = 41.752098, 
        lng = -86.1912859), southwest = list(lat = 41.5973428, 
        lng = -86.3604831)), location = list(lat = 41.6763545, 
        lng = -86.2519898), location_type = "APPROXIMATE", viewport = list(
        northeast = list(lat = 41.752098, lng = -86.1912859), 
        southwest = list(lat = 41.5973428, lng = -86.3604831))), 
    place_id = "ChIJE9NhSsQyEYgRBDKjb7PZSpc", types = list("locality", 
        "political"))), status = "OK"), list(results = list(list(
    address_components = list(list(long_name = "Lanesboro", short_name = "Lanesboro", 
        types = list("locality", "political")), list(long_name = "Holt Township", 
        short_name = "Holt Township", types = list("administrative_area_level_3", 
            "political")), list(long_name = "Fillmore County", 
        short_name = "Fillmore County", types = list("administrative_area_level_2", 
            "political")), list(long_name = "Minnesota", short_name = "MN", 
        types = list("administrative_area_level_1", "political")), 
        list(long_name = "United States", short_name = "US", 
            types = list("country", "political")), list(long_name = "55949", 
            short_name = "55949", types = list("postal_code"))), 
    formatted_address = "Lanesboro, MN 55949, USA", geometry = list(
        bounds = list(northeast = list(lat = 43.7312198, lng = -91.9545843), 
            southwest = list(lat = 43.7060355, lng = -91.9844293)), 
        location = list(lat = 43.7187813, lng = -91.9759204), 
        location_type = "APPROXIMATE", viewport = list(northeast = list(
            lat = 43.7312198, lng = -91.9545843), southwest = list(
            lat = 43.7060355, lng = -91.9844293))), place_id = "ChIJr2SDMZco-ocRb_dB0eZDTLU", 
    types = list("locality", "political"))), status = "OK"), 
    list(results = list(list(address_components = list(list(long_name = "Cheboygan", 
        short_name = "Cheboygan", types = list("locality", "political")), 
        list(long_name = "Cheboygan County", short_name = "Cheboygan County", 
            types = list("administrative_area_level_2", "political")), 
        list(long_name = "Michigan", short_name = "MI", types = list(
            "administrative_area_level_1", "political")), list(
            long_name = "United States", short_name = "US", types = list(
                "country", "political")), list(long_name = "49721", 
            short_name = "49721", types = list("postal_code"))), 
        formatted_address = "Cheboygan, MI 49721, USA", geometry = list(
            bounds = list(northeast = list(lat = 45.669849, lng = -84.4330271), 
                southwest = list(lat = 45.6198179, lng = -84.4984899)), 
            location = list(lat = 45.6469563, lng = -84.4744795), 
            location_type = "APPROXIMATE", viewport = list(northeast = list(
                lat = 45.669849, lng = -84.4330271), southwest = list(
                lat = 45.6198179, lng = -84.4984899))), place_id = "ChIJywA0rYKiNU0R6yCfyEI79dI", 
        types = list("locality", "political"))), status = "OK"), 
    list(results = list(list(address_components = list(list(long_name = "Chelmsford", 
        short_name = "Chelmsford", types = list("locality", "political")), 
        list(long_name = "Middlesex County", short_name = "Middlesex County", 
            types = list("administrative_area_level_2", "political")), 
        list(long_name = "Massachusetts", short_name = "MA", 
            types = list("administrative_area_level_1", "political")), 
        list(long_name = "United States", short_name = "US", 
            types = list("country", "political"))), formatted_address = "Chelmsford, MA, USA", 
        geometry = list(bounds = list(northeast = list(lat = 42.653754, 
            lng = -71.2942208), southwest = list(lat = 42.5496288, 
            lng = -71.4178121)), location = list(lat = 42.5998139, 
            lng = -71.3672838), location_type = "APPROXIMATE", 
            viewport = list(northeast = list(lat = 42.653754, 
                lng = -71.2942208), southwest = list(lat = 42.5496288, 
                lng = -71.4178121))), place_id = "ChIJx0tLqRej44kRi__M1sjNzjc", 
        types = list("locality", "political"))), status = "OK"), 
    list(results = list(list(address_components = list(list(long_name = "Saint Clairsville", 
        short_name = "St Clairsville", types = list("locality", 
            "political")), list(long_name = "Richland Township", 
        short_name = "Richland Township", types = list("administrative_area_level_3", 
            "political")), list(long_name = "Belmont County", 
        short_name = "Belmont County", types = list("administrative_area_level_2", 
            "political")), list(long_name = "Ohio", short_name = "OH", 
        types = list("administrative_area_level_1", "political")), 
        list(long_name = "United States", short_name = "US", 
            types = list("country", "political")), list(long_name = "43950", 
            short_name = "43950", types = list("postal_code"))), 
        formatted_address = "St Clairsville, OH 43950, USA", 
        geometry = list(bounds = list(northeast = list(lat = 40.097176, 
            lng = -80.8753491), southwest = list(lat = 40.0569829, 
            lng = -80.9266679)), location = list(lat = 40.0803199, 
            lng = -80.90176), location_type = "APPROXIMATE", 
            viewport = list(northeast = list(lat = 40.097176, 
                lng = -80.8753491), southwest = list(lat = 40.0569829, 
                lng = -80.9266679))), place_id = "ChIJD9-5fMFwNogRmDV43jTEVS0", 
        types = list("locality", "political"))), status = "OK"), 
    list(results = list(list(address_components = list(list(long_name = "New Hyde Park", 
        short_name = "New Hyde Park", types = list("locality", 
            "political")), list(long_name = "North Hempstead", 
        short_name = "North Hempstead", types = list("administrative_area_level_3", 
            "political")), list(long_name = "Nassau County", 
        short_name = "Nassau County", types = list("administrative_area_level_2", 
            "political")), list(long_name = "New York", short_name = "NY", 
        types = list("administrative_area_level_1", "political")), 
        list(long_name = "United States", short_name = "US", 
            types = list("country", "political"))), formatted_address = "New Hyde Park, NY, USA", 
        geometry = list(bounds = list(northeast = list(lat = 40.7419718, 
            lng = -73.6748929), southwest = list(lat = 40.7233181, 
            lng = -73.69721)), location = list(lat = 40.7351018, 
            lng = -73.6879082), location_type = "APPROXIMATE", 
            viewport = list(northeast = list(lat = 40.7419718, 
                lng = -73.6748929), southwest = list(lat = 40.7233181, 
                lng = -73.69721))), place_id = "ChIJOfwQ1pJiwokRQIZrHiBxJbA", 
        types = list("locality", "political"))), status = "OK"), 
    list(results = list(list(address_components = list(list(long_name = "Jefferson", 
        short_name = "Jefferson", types = list("locality", "political")), 
        list(long_name = "Jefferson", short_name = "Jefferson", 
            types = list("administrative_area_level_3", "political")), 
        list(long_name = "Ashe County", short_name = "Ashe County", 
            types = list("administrative_area_level_2", "political")), 
        list(long_name = "North Carolina", short_name = "NC", 
            types = list("administrative_area_level_1", "political")), 
        list(long_name = "United States", short_name = "US", 
            types = list("country", "political")), list(long_name = "28640", 
            short_name = "28640", types = list("postal_code"))), 
        formatted_address = "Jefferson, NC 28640, USA", geometry = list(
            bounds = list(northeast = list(lat = 36.430581, lng = -81.422682), 
                southwest = list(lat = 36.404752, lng = -81.4894969)), 
            location = list(lat = 36.420403, lng = -81.4734376), 
            location_type = "APPROXIMATE", viewport = list(northeast = list(
                lat = 36.430581, lng = -81.422682), southwest = list(
                lat = 36.404752, lng = -81.4894969))), place_id = "ChIJJfTHvEasUYgRsEKY3vcTFgc", 
        types = list("locality", "political"))), status = "OK"))

Upvotes: 2

Answers (3)

ulfelder

Reputation: 5335

After a lot of trial and error, I ended up figuring out how to do this with some help from the pluck() and keep() functions from purrr in particular. I wrote a function that lets me set the attribute I'm after, then used map_dfc() to iterate that function over the three attributes in my desired output: locality name, county name, and state name.

library(tidyverse)

geo_extractor <- function(api_output, attribute, version = 'long_name') {

  api_output %>%
    map(., ~purrr::pluck(., 'results', 1, 'address_components')) %>%
    map(., ~keep(., grepl(attribute, .))) %>%
    map_chr(., ~purrr::pluck(., 1, version))

}

desiderata <- c("locality", "level_2", "level_1")

dat <- setNames(map_dfc(desiderata, ~geo_extractor(geos, .)), desiderata)

Here's how the result looks.

> dat
# A tibble: 10 x 3
   locality          level_2           level_1       
   <chr>             <chr>             <chr>         
 1 Franconia         Grafton County    New Hampshire 
 2 Wausau            Marathon County   Wisconsin     
 3 Northfield        Franklin County   Massachusetts 
 4 South Bend        St. Joseph County Indiana       
 5 Lanesboro         Fillmore County   Minnesota     
 6 Cheboygan         Cheboygan County  Michigan      
 7 Chelmsford        Middlesex County  Massachusetts 
 8 Saint Clairsville Belmont County    Ohio          
 9 New Hyde Park     Nassau County     New York      
10 Jefferson         Ashe County       North Carolina

I know from solving a related version of this problem a slightly different way that this function will probably fail if the API output (here, geos) includes results for locations that couldn't be resolved or that don't include one or more of the attributes you're seeking (e.g., no county). I also know that you can work around that problem with some properly placed if/else constructs. That's not an issue in this toy example, however, so I'll declare victory for this question and move on.

Upvotes: 1

Onyambu

Reputation: 79228

You could do: There are many more columns

stack(unlist(setNames(address_components,1:10)))%>%
   separate(ind,c("grp","nm"),"[.]")%>%
   group_by(grp,id = cumsum(str_detect(nm,"long_name")))%>%
   pivot_wider(c(id,grp),nm,values_from = values)%>%
   pivot_wider(grp,c(types1,types2,types),values_from = long_name)
# A tibble: 10 x 7
# Groups:   grp [10]
   grp   locality_politic~ administrative_a~ administrative_~ country_politic~ administrative_~ NA_NA_postal_co~
   <chr> <chr>             <chr>             <chr>            <chr>            <chr>            <chr>           
 1 1     Franconia         Grafton County    New Hampshire    United States    NA               NA              
 2 2     Wausau            Marathon County   Wisconsin        United States    NA               NA              
 3 3     Northfield        Franklin County   Massachusetts    United States    NA               NA              
 4 4     South Bend        St. Joseph County Indiana          United States    Portage Township NA              
 5 5     Lanesboro         Fillmore County   Minnesota        United States    Holt Township    55949           
 6 6     Cheboygan         Cheboygan County  Michigan         United States    NA               49721           
 7 7     Chelmsford        Middlesex County  Massachusetts    United States    NA               NA              
 8 8     Saint Clairsville Belmont County    Ohio             United States    Richland Townsh~ 43950           
 9 9     New Hyde Park     Nassau County     New York         United States    North Hempstead  NA              
10 10    Jefferson         Ashe County       North Carolina   United States    Jefferson        28640

or if you want the short names:

stack(unlist(setNames(address_components,1:10)))%>%
   separate(ind,c("grp","nm"),"[.]")%>%
   group_by(grp,id = cumsum(str_detect(nm,"long_name")))%>%
   pivot_wider(c(id,grp),nm,values_from = values)%>%
   pivot_wider(grp,c(types1,types2,types),values_from = short_name)
# A tibble: 10 x 7
# Groups:   grp [10]
   grp   locality_politic~ administrative_a~ administrative_~ country_politic~ administrative_~ NA_NA_postal_co~
   <chr> <chr>             <chr>             <chr>            <chr>            <chr>            <chr>           
 1 1     Franconia         Grafton County    NH               US               NA               NA              
 2 2     Wausau            Marathon County   WI               US               NA               NA              
 3 3     Northfield        Franklin County   MA               US               NA               NA              
 4 4     South Bend        St Joseph County  IN               US               Portage Township NA              
 5 5     Lanesboro         Fillmore County   MN               US               Holt Township    55949           
 6 6     Cheboygan         Cheboygan County  MI               US               NA               49721           
 7 7     Chelmsford        Middlesex County  MA               US               NA               NA              
 8 8     St Clairsville    Belmont County    OH               US               Richland Townsh~ 43950           
 9 9     New Hyde Park     Nassau County     NY               US               North Hempstead  NA              
10 10    Jefferson         Ashe County       NC               US               Jefferson        28640

Upvotes: 1

user10917479

Reputation:

I don't think I solved you all the way there, but it seems like there are several things you would want to do with it.

Does unnesting and coding it as such do what you would like? From here it can be just a bunch of filters and pivots using standard dplyr and tidyr things.

Each record from the original nested list is identified by grouping on record and record2.

library(dplyr)
library(purrr)
library(tibble)

address_long <- address_components %>%
  map_dfr(~ set_names(.x, seq.int(length(.x))), .id = "record") %>% 
  pivot_longer(-record, names_to = "record2") %>% 
  mutate(name = names(value)) %>%
  mutate(value = simplify_all(value)) %>% 
  unnest(value) %>% 
  rowid_to_column()
  
col_types <- address_long %>% 
  filter(name == "types",
         value != "political") %>% 
  select(record, record2, type = value)

address_long %>% 
  filter(name != "types") %>% 
  left_join(col_types, by = c("record", "record2"))

# # A tibble: 98 x 6
# rowid record record2 value           name       type                       
# <int> <chr>  <chr>   <chr>           <chr>      <chr>                      
# 1     1 1      1       Franconia       long_name  locality                   
# 2     2 1      2       Grafton County  long_name  administrative_area_level_2
# 3     3 1      3       New Hampshire   long_name  administrative_area_level_1
# 4     4 1      4       United States   long_name  country                    
# 5     5 1      1       Franconia       short_name locality                   
# 6     6 1      2       Grafton County  short_name administrative_area_level_2
# 7     7 1      3       NH              short_name administrative_area_level_1
# 8     8 1      4       US              short_name country                    
# 9    17 2      1       Wausau          long_name  locality                   
# 10   18 2      2       Marathon County long_name  administrative_area_level_2
# # ... with 88 more rows

In your example, you would want to filter value to

Upvotes: 0

How can I extract elements from a list of lists based on element type instead of name or position?

Answers (3)

Related Questions