Reputation: 982
I'm looking for U.S. zipcodes in a df of addresses and postal codes. The records are unnormalised and contain data entry errors. Normal zipcodes are 5-digit numbers. I'm also looking for 4-digit numbers in the postal code field because the leading 0 may have been dropped by inadvertent conversion to a numeric value. 5-digit zipcodes can be followed by a hyphen and 4 digits, which I'd also like to capture. I expect only one zipcode in the postal code field, but it's possible there might be several in the address field because some people type two addresses in a form meant for only one. I want to look for zipcodes in addresses only if none is found in postal codes, because 5-digit street numbers in addresses would yield too many false positives. I realise addresses outside the U.S. will yield false positives if the local postal code matches the U.S. format.
My desired outcome is a new zip_code column of strings of unique found zipcodes separated by "/" (if multiple), with a leading 0 added to 4-digit zipcodes, or an empty string if nothing is found.
Here is my attempt, which totally doesn't work:
pc <- offices$postal_code[offices$postal_code != ""]
offices$zip_code[offices$postal_code != ""] <- unlist(lapply(pc, function(x) {
pczc <- sub("(?<!\\d)(\\d{5}\\-\\d{4}|\\d{4,5})(?!\\d)", "\\2", x, perl = T)
pczc <- ifelse(nchar(pczc) == 4, paste0("0", pczc), pczc)
return(pczc)
}))
ad <- offices$address[offices$zip_code == ""]
adzc <- regmatches(ad, gregexpr("(?<!\\d)(\\d{5}\\-\\d{4}|\\d{5})(?!\\d)", ad, perl = T))
offices$zip_code[offices$zip_code != ""] <- paste(unique(adzc), collapse = "/")
My regex don't work. Also, there may be a faster way of doing this (I have a lot of addresses). I've thought of merging the two fields first, but the problem with that approach is that 4-digit street numbers in the address field would be confused with zipcodes missing a leading 0, which can only occur in the postal code field. (There are also 5-digit street numbers, but I guess that can't be helped.)
Here's a sample of df offices (which doesn't include all possible use cases, e.g. either field may be empty):
structure(list(address = c("Headquarters 2355 E. Camelback Road Suite 300 Phoenix",
"Headquarters 1401 Constitution Ave NW Washington", "Headquarters 80 State Street 7th Floor Albany",
"Headquarters Spray Gaarde 46 Nieuwegein", "HQ 1055 Washington Blvd., 7th Floor Stamford",
"Headquarters Village Khubavali, PO Paud Taluka Mulshi Pune",
"Headquarters 231 Lagrange Street Boston", "Headquarters 401 Chestnut St Suite 410 Chattanooga",
"Israel Office st. ha Rav Bar Shaul 6 Rehovot", "Headquarters 7721 New Market Street Olympia",
"HQ Bernrieder Str. 15 Niederwinkling", "Headquarters 2810 Sydney Road Plant City",
"Headquarters 1350 Avenue of the Americas 9th Floor New York",
"Headquarters Askanischer Platz 3 Berlin", "Australian Head Office Level 2, 145 Flinders Lane Melbourne",
"HQ 13303 Washington Avenue Racine", "HQ 9150 E. Del Camino Dr., Ste 112 Scottsdale",
"Arcadia Corporate Merchandise Ltd - Promotional Giveaways Grove Place, Wellington Road High Wycombe",
"Israel Office Shorashim, D.N.Misgav ", "HQ 6009 Penn Avenue S. Minneapolis"
), postal_code = c("85016", "20230", "12207", "3436", "6901",
"412 108", "2132", "37402", "7625149", "98501", "94559", "33566-1173",
"10019", "10963", "3000", "53406", "85258", "HP12 3PR", "20164",
"55419")), .Names = c("address", "postal_code"), row.names = c(1L,
2L, 4L, 5L, 6L, 8L, 10L, 11L, 12L, 14L, 15L, 18L, 19L, 21L, 22L,
23L, 24L, 25L, 27L, 28L), class = "data.frame")
Upvotes: 0
Views: 1237
Reputation: 5951
Instead of performing some complex regex
you could use googlemapsapi
. I am sure there is a R package, but the following code should help.
library(tidyr)
library(magrittr)
library(dplyr)
library(rvest)
library(jsonlite)
library(data.table)
getInfo <- function(data, address){
mURL <- "http://maps.googleapis.com/maps/api/geocode/json?address=" %>%
paste0(unlist(data[address])) %>% gsub("Headquarters|HQ", "", .) %>% sapply(URLencode)
temp <- lapply(mURL, function(y) {
info <- read_html(y) %>% html_text %>% fromJSON(simplifyDataFrame = TRUE)
if(length(info$results)){
info <- info[[1]]$address_components[[1]] %>% as.data.frame %>% select(-short_name)
info$types <- sapply(info$types, function(x) x[1])
info %<>% group_by(types) %>% summarize(long_name=toString(long_name)) %>%
select(long_name, types) %>% ungroup
info %<>% spread(types, long_name)
} else {
info <- data.frame(administrative_area_level_1=NA, administrative_area_level_2=NA,
country=NA, locality=NA, neighborhood=NA, postal_code=NA,
route=NA, street_number=NA, subpremise=NA)
}
info
}) %>% rbindlist(fill=TRUE)
cbind(data, temp)
}
df2 <- getInfo(df, "address")
Test it out on a small subset of your data.frame
which i assumed is named df
.
You can read more on the googlemapsapi
df2 %>% View
Upvotes: 1
Reputation: 78792
The following cleans up the postal_code
column, but you didn't provide a data frame slice with zips in the address field so without knowing what the "real world" data looks like in that column it'd be a potential time sink w/o efficacy. I can add the zip extraction from the address field once you provide more representative data for your question.
library(stringi)
library(purrr)
df$zip_1 <- stri_trim_both(df$postal_code) %>%
stri_match_last_regex("((?:[[:digit:]]{5}-[[:digit:]]{4})|(?:[[:digit:]]{4,5}))") %>%
ifelse(nchar(.)==4, 0 %s+% ., .) %>%
.[,2]
df[,2:3]
## postal_code zip_1
## 1 85016 85016
## 2 20230 20230
## 4 12207 12207
## 5 3436 03436
## 6 6901 06901
## 8 412 108 <NA>
## 10 2132 02132
## 11 37402 37402
## 12 7625149 76251
## 14 98501 98501
## 15 94559 94559
## 18 33566-1173 33566-1173
## 19 10019 10019
## 21 10963 10963
## 22 3000 03000
## 23 53406 53406
## 24 85258 85258
## 25 HP12 3PR <NA>
## 27 20164 20164
## 28 55419 55419
Upvotes: 1