syre
syre

Reputation: 982

Extract unique US zipcodes (5 digits optionally followed by hyphen and 4 digits) from addresses

I'm looking for U.S. zipcodes in a df of addresses and postal codes. The records are unnormalised and contain data entry errors. Normal zipcodes are 5-digit numbers. I'm also looking for 4-digit numbers in the postal code field because the leading 0 may have been dropped by inadvertent conversion to a numeric value. 5-digit zipcodes can be followed by a hyphen and 4 digits, which I'd also like to capture. I expect only one zipcode in the postal code field, but it's possible there might be several in the address field because some people type two addresses in a form meant for only one. I want to look for zipcodes in addresses only if none is found in postal codes, because 5-digit street numbers in addresses would yield too many false positives. I realise addresses outside the U.S. will yield false positives if the local postal code matches the U.S. format.

My desired outcome is a new zip_code column of strings of unique found zipcodes separated by "/" (if multiple), with a leading 0 added to 4-digit zipcodes, or an empty string if nothing is found.

Here is my attempt, which totally doesn't work:

pc <- offices$postal_code[offices$postal_code != ""]
offices$zip_code[offices$postal_code != ""] <- unlist(lapply(pc, function(x) {
  pczc <- sub("(?<!\\d)(\\d{5}\\-\\d{4}|\\d{4,5})(?!\\d)", "\\2", x, perl = T)
  pczc <- ifelse(nchar(pczc) == 4, paste0("0", pczc), pczc)
  return(pczc)
  }))

ad <- offices$address[offices$zip_code == ""]
adzc <- regmatches(ad, gregexpr("(?<!\\d)(\\d{5}\\-\\d{4}|\\d{5})(?!\\d)", ad, perl = T))
offices$zip_code[offices$zip_code != ""] <- paste(unique(adzc), collapse = "/")

My regex don't work. Also, there may be a faster way of doing this (I have a lot of addresses). I've thought of merging the two fields first, but the problem with that approach is that 4-digit street numbers in the address field would be confused with zipcodes missing a leading 0, which can only occur in the postal code field. (There are also 5-digit street numbers, but I guess that can't be helped.)

Here's a sample of df offices (which doesn't include all possible use cases, e.g. either field may be empty):

structure(list(address = c("Headquarters 2355 E. Camelback Road Suite 300 Phoenix", 
"Headquarters 1401 Constitution Ave NW  Washington", "Headquarters 80 State Street 7th Floor  Albany", 
"Headquarters Spray Gaarde 46  Nieuwegein", "HQ 1055 Washington Blvd., 7th Floor  Stamford", 
"Headquarters Village Khubavali, PO Paud Taluka Mulshi  Pune", 
"Headquarters 231 Lagrange Street  Boston", "Headquarters 401 Chestnut St Suite 410 Chattanooga", 
"Israel Office st. ha Rav Bar Shaul 6  Rehovot", "Headquarters 7721 New Market Street  Olympia", 
"HQ Bernrieder Str. 15  Niederwinkling", "Headquarters 2810 Sydney Road  Plant City", 
"Headquarters 1350 Avenue of the Americas 9th Floor New York", 
"Headquarters Askanischer Platz 3  Berlin", "Australian Head Office Level 2, 145 Flinders Lane  Melbourne", 
"HQ 13303 Washington Avenue  Racine", "HQ 9150 E. Del Camino Dr., Ste 112  Scottsdale", 
"Arcadia Corporate Merchandise Ltd - Promotional Giveaways Grove Place, Wellington Road  High Wycombe", 
"Israel Office Shorashim, D.N.Misgav  ", "HQ 6009 Penn Avenue S.  Minneapolis"
), postal_code = c("85016", "20230", "12207", "3436", "6901", 
"412 108", "2132", "37402", "7625149", "98501", "94559", "33566-1173", 
"10019", "10963", "3000", "53406", "85258", "HP12 3PR", "20164", 
"55419")), .Names = c("address", "postal_code"), row.names = c(1L, 
2L, 4L, 5L, 6L, 8L, 10L, 11L, 12L, 14L, 15L, 18L, 19L, 21L, 22L, 
23L, 24L, 25L, 27L, 28L), class = "data.frame")

Upvotes: 0

Views: 1237

Answers (2)

dimitris_ps
dimitris_ps

Reputation: 5951

Instead of performing some complex regex you could use googlemapsapi. I am sure there is a R package, but the following code should help.

library(tidyr)
library(magrittr)
library(dplyr)
library(rvest)
library(jsonlite)
library(data.table)


getInfo <- function(data, address){

    mURL <- "http://maps.googleapis.com/maps/api/geocode/json?address=" %>% 
      paste0(unlist(data[address])) %>% gsub("Headquarters|HQ", "", .) %>% sapply(URLencode)

    temp <- lapply(mURL, function(y) {
                info <- read_html(y) %>% html_text %>% fromJSON(simplifyDataFrame = TRUE)
                if(length(info$results)){
                    info <- info[[1]]$address_components[[1]] %>% as.data.frame %>% select(-short_name)
                    info$types <- sapply(info$types, function(x) x[1])
                    info %<>% group_by(types) %>% summarize(long_name=toString(long_name)) %>% 
                             select(long_name, types) %>% ungroup
                    info %<>% spread(types, long_name)
                } else {
                  info <- data.frame(administrative_area_level_1=NA, administrative_area_level_2=NA, 
                                     country=NA, locality=NA, neighborhood=NA, postal_code=NA, 
                                     route=NA, street_number=NA, subpremise=NA)
                }
                info
      }) %>% rbindlist(fill=TRUE)

cbind(data, temp)

}

df2 <- getInfo(df, "address")

Test it out on a small subset of your data.frame which i assumed is named df. You can read more on the googlemapsapi

df2 %>% View

enter image description here

Upvotes: 1

hrbrmstr
hrbrmstr

Reputation: 78792

The following cleans up the postal_code column, but you didn't provide a data frame slice with zips in the address field so without knowing what the "real world" data looks like in that column it'd be a potential time sink w/o efficacy. I can add the zip extraction from the address field once you provide more representative data for your question.

library(stringi)
library(purrr)

df$zip_1 <- stri_trim_both(df$postal_code) %>%
  stri_match_last_regex("((?:[[:digit:]]{5}-[[:digit:]]{4})|(?:[[:digit:]]{4,5}))") %>%
  ifelse(nchar(.)==4, 0 %s+% ., .) %>%
  .[,2]

df[,2:3]
##    postal_code      zip_1
## 1        85016      85016
## 2        20230      20230
## 4        12207      12207
## 5         3436      03436
## 6         6901      06901
## 8      412 108       <NA>
## 10        2132      02132
## 11       37402      37402
## 12     7625149      76251
## 14       98501      98501
## 15       94559      94559
## 18  33566-1173 33566-1173
## 19       10019      10019
## 21       10963      10963
## 22        3000      03000
## 23       53406      53406
## 24       85258      85258
## 25    HP12 3PR       <NA>
## 27       20164      20164
## 28       55419      55419

Upvotes: 1

Related Questions