R: use grep to find one or several matches in order of importance

Question

I am using grep to tidy up some address data, my goal here specifically is to identify the street / avenue / road name etc. in a given record and column, which has already been split by space into individual words in the following variable tempval, for example:

R > tempval
[1] "38"   "WILLOW" "PARK"

I use the following statement to spot where some of the word that will follow the street name might be:

  stID <- grep("STREET|\bST\b|AVENUE|\bAVE\b|\bAV\b|WAY|BOULEVARD|\bBD\b|ROAD|\bRD\b|PLACE|\bPL\b|ESPLANADE|TERRACE|PARADE|DRIVE|\bDR\b|\bPARK\b|LANE|CRESCENT|\bCOURT\b|b\CRES\b", tempval, ignore.case = T)

R > stID
[1] 3

This is fine, I know "PARK" is the 3rd element and what comes before that will be my street number and name.

However a problem arises when there are several matches so length(stID) > 1, for example:

R > tempval
[1] "38"   "PARK" "ST"

So here, I get

R > stID
[1] 2 3

How do I get R to return only one match, in order of importance (the order in which I have placed the strings in the pattern of grep)? In other words, if R finds both "ST" and "PARK", "ST" is more important than "PARK" thus return stID = 3 only?

Joris Meys · Accepted Answer

Using grep is very dangerous, as your grep would -even when it would take the priority into account- return "streetlife" as the street name when trying it on "streetlife Park" (it would find "street" in "streetlife").

Hence I suggest you use match instead. Convert everything to lower and use a vector with values in the order of importance. Then you can use match to see at what positions in x you have a match with that vector. Now you have to look for the first value that is not NA and you're done:

checkstreet <- function(x){
  x <- tolower(x)
  thenames <- c("street","st","avenue","ave","av",
                "way","boulevard", "bd", "road", "rd",
                "place", "pl", "esplanade","terrace","parade",
                "drive","dr","park","lane","crescent","court",
                "cres")

  id <- match(thenames, x)
  id[!is.na(id)][1]
}

gives:

> tmpval <- c("38","park","street")
> checkstreet(tmpval)
[1] 3
> tmpval <- c("44","Average","Esplanade")
> checkstreet(tmpval)
[1] 3

If you insist on using grep and keep on using the \b for your word boundaries, you can use the same logic, but this time using which.min :

checkstreet <- function(x){
  x <- tolower(x)
  thenames <- c("street","st","avenue","ave","av",
                "way","boulevard", "bd", "road", "rd",
                "place", "pl", "esplanade","terrace","parade",
                "drive","dr","park","lane","crescent","court",
                "cres")

  which.min(lapply(x,grep,thenames))
}

R: use grep to find one or several matches in order of importance

Answers (2)

Related Questions