AlanJackson
AlanJackson

Reputation: 31

Extracting a few items from a complex R deeply nested list

I am geocoding addresses with google. Google tries very hard to return a lat/long, even when the address is bogus or flawed. I found a way to extract what I need from the geocode output, but I wondered if there was a better way to do it. Preferably outputting a data frame. The addresses come from a large police database, that obviously has issues. There is no such thing as "Homeless" street. But Google happily supplies lat/longs for a homeless shelter. Here is my reproducible code:

library(ggmap)

Addresses <- c("100  WEBSTER , Houston, TX",
               "1100  RUSK , Houston, TX",
               "700  HOMELESS , Houston, TX")

AllLatLong <- geocode(Addresses, output="all")

getfields <- function(x){c(
              x$results[[1]]$types[1],
              x$results[[1]]$geometry$location$lat,
              x$results[[1]]$geometry$location$lng
)}

latlongs <- lapply(AllLatLong, getfields)

latlongs
[[1]]
[1] "street_address" "29.7532733"     "-95.3814513"   

[[2]]
[1] "street_address" "29.7579245"     "-95.3628562"   

[[3]]
[1] "establishment" "29.7460122"    "-95.3663581"  

There is one problem in it I know of - it is not robust. If the daily query limit is exceeded, then it fails because the structure of that list element is completely different, so I guess I would need to put a test into the function to look for the results field being length zero.


Here is a different(better?) solution, that also tales care of error conditions, and outputs a data frame.

library(purrr)

getfields <- function(x){
          if(length(x$results)>0)  {data.frame(
          status=x$results[[1]]$types[1],
          Latitude=as.numeric(x$results[[1]]$geometry$location$lat),
          Longitude=as.numeric(x$results[[1]]$geometry$location$lng),
          stringsAsFactors = FALSE)
          } else{
            data.frame(status=NA,Latitude=NA,Longitude=NA)
          }
}

latlongs <- map_df(AllLatLong, getfields)

Upvotes: 2

Views: 472

Answers (1)

Cristian E. Nuno
Cristian E. Nuno

Reputation: 2920

Overview

Using lapply(), I applied your getfield() function to each list within AllLatLon to only keep the vectors of interest. This is stored in the list object filtered.geocode.results.

Afterwards, I collapsed each list into one data frame using data.frame(), do.call() and rbind(). Finally, I used cbind() to add the original input addresses as the first column in the newly created data frame df.

I don't know what the best way would be for you to compare the input.address column with the formatted.address column. If I were you, I would probably ask a new question on how to check whether or not the results from ggmap::gecode() actually reflect your original input addresses. For now, having the two columns side-by-side should help you manually check.

Reproducible Example

# load necessary package
library( ggmap )

# create vector of addresses
Addresses <- 
  c( "100  WEBSTER , Houston, TX"
     , "1100  RUSK , Houston, TX"
     , "700  HOMELESS , Houston, TX"
     , "123, Houston, TX"
     , "Chicago, Houston, NY"
  )

# geocode Addresses
AllLatLong <- geocode( location = Addresses
                       , output = "all"
                       , source = "google"
)

# filter the returned list
# by desired vectors
filtered.geocode.results <-
  lapply( X = AllLatLong
          , FUN = function( i )
            # if the the length of the list within X is
            # not zero
            # concatenate the desired vectors
            # for each list within X
            if( length( i$results ) != 0 ){
              c(
                i$results[[1]]$formatted_address
                , i$results[[1]]$types[1]
                , i$results[[1]]$geometry$location$lat
                , i$results[[1]]$geometry$location$lng
              )
            } else{
              # if the length the list within X
              # is 0, return a concatenated vector
              # of four NAs
              c( rep( x = NA, times = 4 ) )
            }

  )

# collapse the lists
# into one data frame
df <-
  data.frame(
    do.call( what = rbind
             , args = filtered.geocode.results
    )
    , stringsAsFactors = FALSE
  )

# add original addresses
# back into data frame
df <- cbind( Addresses, df, stringsAsFactors = FALSE )

# assign column names
colnames( df ) <-
  c( "input.address", "formatted.address"
     , "address.type", "lat", "long" )

# check dim
dim( df ) # [1] 5 5

# view data frame
df
#                 input.address                             formatted.address   address.type
# 1  100  WEBSTER , Houston, TX        100 Webster St, Houston, TX 77019, USA street_address
# 2    1100  RUSK , Houston, TX          1100 Rusk St, Houston, TX 77002, USA street_address
# 3 700  HOMELESS , Houston, TX 2000 Crawford St #700, Houston, TX 77002, USA  establishment
# 4            123, Houston, TX                                          <NA>           <NA>
# 5        Chicago, Houston, NY                                          <NA>           <NA>
#          lat        long
# 1 29.7532733 -95.3814513
# 2 29.7579245 -95.3628562
# 3 29.7460122 -95.3663581
# 4       <NA>        <NA>
# 5       <NA>        <NA>

# end of script #

Upvotes: 1

Related Questions