Reputation: 7664
Using the XML package and XPath to scrape addresses from websites, I sometimes can get only a string that has embedded in it the zip code I want. It is straightforward to extract the zip code, but sometimes there are other five-digit strings that show up.
Here are some variations on the problem in a df.
zips <- data.frame(id = seq(1, 5), address = c("Company, 18540 Main Ave., City, ST 12345", "Company 18540 Main Ave. City ST 12345-0000", "Company 18540 Main Ave. City State 12345", "Company, 18540 Main Ave., City, ST 12345 USA", "Company, One Main Ave Suite 18540, City, ST 12345"))
The R statement to extract zip codes (both 5 digit and plus 4 digits) is below, but it is tricked by the faux zip codes of the street number and the suite number (and there may be other possibilities in other address strings).
regmatches(zips$address, gregexpr("\\d{5}([-]?\\d{4})?", zips$address, perl = TRUE))
An answer to a previous SO question suggested that a "regex will return the last consecutive five digit string. It uses a negative look-ahead to ensure the absence of 5-digit strings after the one being returned."
Extracting a zip code from an address string
\b\d{5}\b(?!.*\b\d{5}\b)
But that question and answer deals with PHP and offers an if loop with preg_matches()` I am not familiar with those languages and tools, but the idea might be right.
My question: what R code will find real zip codes and ignore false lookalikes?
Upvotes: 5
Views: 2874
Reputation: 109874
The qdapRegex
package has the rm_zip
function for this:
zips <- data.frame(id = seq(1, 5),
address = c("Company, 18540 Main Ave., City, ST 12345",
"Company 18540 Main Ave. City ST 12345-0000",
"Company 18540 Main Ave. City State 12345",
"Company, 18540 Main Ave., City, ST 12345 USA",
"Company, One Main Ave Suite 18540, City, ST 12345")
)
lapply(rm_zip(zips$address, extract=TRUE), tail, 1)
## [[1]]
## [1] "12345"
##
## [[2]]
## [1] "12345-0000"
##
## [[3]]
## [1] "12345"
##
## [[4]]
## [1] "12345"
##
## [[5]]
## [1] "12345"
EDIT Per @lawyeR's comments:
I think that you want some regex that is more specific than the dictionary system used by qdapRegex
. The current implementation of rm_zip
allows for validation purposes and thus I wouldn't alter the regular expression it uses to be more flexible. I also wouldn't alter the function rm_zip
to have additional parameters/arguments as qdapRegex
attempts to have consistently operating functions.
That being said you could create your own function using the rm_
function and supply your own regular expression. I have done this using both of the parameters specified in your comment:
More complex data set:
zips <- data.frame(id = seq(1, 6),
address = c("Company, 18540 Main Ave., City, ST 12345",
"Company 18540 Main Ave. City ST 12345-0000",
"Company 18540 Main Ave. City State 12345",
"Company, 18540 Main Ave., City, ST 12345 USA",
"Company, One Main Ave Suite 18540m, City, ST 12345",
"company 12345678")
)
Function to grab even if a character follows the zip
## paste together a more flexible regular expression
pat <- pastex(
"@rm_zip",
"(?<!\\d)\\d{5}(?!\\d)",
"(?<!\\d)\\d{5}-\\d{4}(?!\\d)"
)
## Create your own function that extract is set to TRUE
rm_zip2 <- rm_(pattern=pat, extract=TRUE)
rm_zip2(zips$address)
## [[1]]
## [1] "18540" "12345"
##
## [[2]]
## [1] "18540" "12345-0000"
##
## [[3]]
## [1] "18540" "12345"
##
## [[4]]
## [1] "18540" "12345"
##
## [[5]]
## [1] "18540" "12345"
##
## [[6]]
## [1] NA
Function to extract just 5 digit zips
rm_zip3 <- rm_(pattern="(?<!\\d)\\d{5}(?!\\d)", extract=TRUE)
rm_zip3(zips$address)
## [[1]]
## [1] "18540" "12345"
##
## [[2]]
## [1] "18540" "12345"
##
## [[3]]
## [1] "18540" "12345"
##
## [[4]]
## [1] "18540" "12345"
##
## [[5]]
## [1] "18540" "12345"
##
## [[6]]
## [1] NA
Upvotes: 1
Reputation: 20811
This is my first regex answer (I am still learning) so hopefully I don't say anything wrong to lead you in the wrong direction.
Basically, this regex looks for, as you hinted in your question, the last string that looks like a zip code which is not followed by a string that looks like a zip code
the basic syntax is pattern(?!.*pattern)
which says to match pattern
only if it is not followed (a negative look-ahead assertion, syntax: (?! )
) by anything .*
and pattern
so we can replace pattern with what you are interested in finding:
[0-9]{5}(-[0-9]{4})?
that is, a digit string [0-9]
of exactly 5 characters {5}
(which may optionally be followed ?
by another group defined as a hyphen and another digit string of length four (-[0-9]{4})
put it all together with gregexpr
to search for the matches and regmatches
to interpret the results for me, I get:
zips <- data.frame(id = seq(1, 5), address = c("Company, 18540 Main Ave., City, ST 12345", "Company 18540 Main Ave. City ST 12345-0000", "Company 18540 Main Ave. City State 12345", "Company, 18540 Main Ave., City, ST 12345 USA", "Company, One Main Ave Suite 18540, City, ST 12345"))
regmatches(zips$address,
gregexpr('[0-9]{5}(-[0-9]{4})?(?!.*[0-9]{5}(-[0-9]{4})?)', zips$address, perl = TRUE))
# [[1]]
# [1] "12345"
#
# [[2]]
# [1] "12345-0000"
#
# [[3]]
# [1] "12345"
#
# [[4]]
# [1] "12345"
#
# [[5]]
# [1] "12345"
Upvotes: 4