Regex to extract US zip codes but not faux codes

Question

Using the XML package and XPath to scrape addresses from websites, I sometimes can get only a string that has embedded in it the zip code I want. It is straightforward to extract the zip code, but sometimes there are other five-digit strings that show up.

Here are some variations on the problem in a df.

zips <- data.frame(id = seq(1, 5), address = c("Company, 18540 Main Ave., City, ST 12345", "Company 18540 Main Ave. City ST 12345-0000", "Company 18540 Main Ave. City State 12345", "Company, 18540 Main Ave., City, ST 12345 USA", "Company, One Main Ave Suite 18540, City, ST 12345"))

The R statement to extract zip codes (both 5 digit and plus 4 digits) is below, but it is tricked by the faux zip codes of the street number and the suite number (and there may be other possibilities in other address strings).

regmatches(zips$address, gregexpr("\d{5}([-]?\d{4})?", zips$address, perl = TRUE))

An answer to a previous SO question suggested that a "regex will return the last consecutive five digit string. It uses a negative look-ahead to ensure the absence of 5-digit strings after the one being returned."
Extracting a zip code from an address string

\b\d{5}\b(?!.*\b\d{5}\b)

But that question and answer deals with PHP and offers an if loop with preg_matches()` I am not familiar with those languages and tools, but the idea might be right.

My question: what R code will find real zip codes and ignore false lookalikes?

rawr · Accepted Answer

This is my first regex answer (I am still learning) so hopefully I don't say anything wrong to lead you in the wrong direction.

Basically, this regex looks for, as you hinted in your question, the last string that looks like a zip code which is not followed by a string that looks like a zip code

the basic syntax is pattern(?!.*pattern) which says to match pattern only if it is not followed (a negative look-ahead assertion, syntax: (?! )) by anything .* and pattern

so we can replace pattern with what you are interested in finding:

[0-9]{5}(-[0-9]{4})?

that is, a digit string [0-9] of exactly 5 characters {5} (which may optionally be followed ? by another group defined as a hyphen and another digit string of length four (-[0-9]{4})

put it all together with gregexpr to search for the matches and regmatches to interpret the results for me, I get:

zips <- data.frame(id = seq(1, 5), address = c("Company, 18540 Main Ave., City, ST 12345", "Company 18540 Main Ave. City ST 12345-0000", "Company 18540 Main Ave. City State 12345", "Company, 18540 Main Ave., City, ST 12345 USA", "Company, One Main Ave Suite 18540, City, ST 12345")) 
regmatches(zips$address,
           gregexpr('[0-9]{5}(-[0-9]{4})?(?!.*[0-9]{5}(-[0-9]{4})?)', zips$address, perl = TRUE))

# [[1]]
# [1] "12345"
# 
# [[2]]
# [1] "12345-0000"
# 
# [[3]]
# [1] "12345"
# 
# [[4]]
# [1] "12345"
# 
# [[5]]
# [1] "12345"

Regex to extract US zip codes but not faux codes

Answers (2)

Related Questions