Neodyme
Neodyme

Reputation: 557

R: use grep to find one or several matches in order of importance

I am using grep to tidy up some address data, my goal here specifically is to identify the street / avenue / road name etc. in a given record and column, which has already been split by space into individual words in the following variable tempval, for example:

R > tempval
[1] "38"   "WILLOW" "PARK"  

I use the following statement to spot where some of the word that will follow the street name might be:

  stID <- grep("STREET|\\bST\\b|AVENUE|\\bAVE\\b|\\bAV\\b|WAY|BOULEVARD|\\bBD\\b|ROAD|\\bRD\\b|PLACE|\\bPL\\b|ESPLANADE|TERRACE|PARADE|DRIVE|\\bDR\\b|\\bPARK\\b|LANE|CRESCENT|\\bCOURT\\b|b\\CRES\\b", tempval, ignore.case = T)

R > stID
[1] 3

This is fine, I know "PARK" is the 3rd element and what comes before that will be my street number and name.

However a problem arises when there are several matches so length(stID) > 1, for example:

R > tempval
[1] "38"   "PARK" "ST" 

So here, I get

R > stID
[1] 2 3

How do I get R to return only one match, in order of importance (the order in which I have placed the strings in the pattern of grep)? In other words, if R finds both "ST" and "PARK", "ST" is more important than "PARK" thus return stID = 3 only?

Upvotes: 1

Views: 2044

Answers (2)

Joris Meys
Joris Meys

Reputation: 108543

Using grep is very dangerous, as your grep would -even when it would take the priority into account- return "streetlife" as the street name when trying it on "streetlife Park" (it would find "street" in "streetlife").

Hence I suggest you use match instead. Convert everything to lower and use a vector with values in the order of importance. Then you can use match to see at what positions in x you have a match with that vector. Now you have to look for the first value that is not NA and you're done:

checkstreet <- function(x){
  x <- tolower(x)
  thenames <- c("street","st","avenue","ave","av",
                "way","boulevard", "bd", "road", "rd",
                "place", "pl", "esplanade","terrace","parade",
                "drive","dr","park","lane","crescent","court",
                "cres")

  id <- match(thenames, x)
  id[!is.na(id)][1]
}

gives:

> tmpval <- c("38","park","street")
> checkstreet(tmpval)
[1] 3
> tmpval <- c("44","Average","Esplanade")
> checkstreet(tmpval)
[1] 3

If you insist on using grep and keep on using the \\b for your word boundaries, you can use the same logic, but this time using which.min :

checkstreet <- function(x){
  x <- tolower(x)
  thenames <- c("street","st","avenue","ave","av",
                "way","boulevard", "bd", "road", "rd",
                "place", "pl", "esplanade","terrace","parade",
                "drive","dr","park","lane","crescent","court",
                "cres")

  which.min(lapply(x,grep,thenames))
}

Upvotes: 3

ikop
ikop

Reputation: 1790

You could do it by matching each of the search words individually in a loop and then scoring the match, giving a higher score to matches that are placed earlier in your search list:

## Vector of search terms:
matchVec <- strsplit("STREET|\\bST\\b|AVENUE|\\bAVE\\b|\\bAV\\b|WAY|BOULEVARD|\\bBD\\b|ROAD|\\bRD\\b|PLACE|\\bPL\\b|ESPLANADE|TERRACE|PARADE|DRIVE|\\bDR\\b|\\bPARK\\b|LANE|CRESCENT|\\bCOURT\\b|b\\CRES\\b", "\\|")[[1]]

## Function to determine score of the match:
scoreMatch <- function(myString, matchVec){
    ## Position of matches in the search list:
    position <- which(vapply(matchVec, function(matchStr) grepl(pattern = matchStr, x = myString), 
                    logical(1)))
    ## Score: First search term gets the highest score, second gets second 
    ## highest score etc. No match = score 0:
    score <- ifelse(length(position) > 0, length(matchVec) - position + 1, 0)   
}

## Determine score of each element/word in your vector:
scoreVec <- vapply(tempval, function(x) scoreMatch(x, matchVec), numeric(1))

## Find index with the highest score:
stID <- which.max(scoreVec)

Upvotes: 1

Related Questions