Reputation: 557
I am using grep to tidy up some address data, my goal here specifically is to identify the street / avenue / road name etc. in a given record and column, which has already been split by space into individual words in the following variable tempval, for example:
R > tempval
[1] "38" "WILLOW" "PARK"
I use the following statement to spot where some of the word that will follow the street name might be:
stID <- grep("STREET|\\bST\\b|AVENUE|\\bAVE\\b|\\bAV\\b|WAY|BOULEVARD|\\bBD\\b|ROAD|\\bRD\\b|PLACE|\\bPL\\b|ESPLANADE|TERRACE|PARADE|DRIVE|\\bDR\\b|\\bPARK\\b|LANE|CRESCENT|\\bCOURT\\b|b\\CRES\\b", tempval, ignore.case = T)
R > stID
[1] 3
This is fine, I know "PARK" is the 3rd element and what comes before that will be my street number and name.
However a problem arises when there are several matches so length(stID) > 1
, for example:
R > tempval
[1] "38" "PARK" "ST"
So here, I get
R > stID
[1] 2 3
How do I get R to return only one match, in order of importance (the order in which I have placed the strings in the pattern of grep)? In other words, if R finds both "ST" and "PARK", "ST" is more important than "PARK" thus return stID = 3
only?
Upvotes: 1
Views: 2044
Reputation: 108543
Using grep
is very dangerous, as your grep
would -even when it would take the priority into account- return "streetlife" as the street name when trying it on "streetlife Park" (it would find "street" in "streetlife").
Hence I suggest you use match
instead. Convert everything to lower and use a vector with values in the order of importance. Then you can use match
to see at what positions in x
you have a match with that vector. Now you have to look for the first value that is not NA
and you're done:
checkstreet <- function(x){
x <- tolower(x)
thenames <- c("street","st","avenue","ave","av",
"way","boulevard", "bd", "road", "rd",
"place", "pl", "esplanade","terrace","parade",
"drive","dr","park","lane","crescent","court",
"cres")
id <- match(thenames, x)
id[!is.na(id)][1]
}
gives:
> tmpval <- c("38","park","street")
> checkstreet(tmpval)
[1] 3
> tmpval <- c("44","Average","Esplanade")
> checkstreet(tmpval)
[1] 3
If you insist on using grep and keep on using the \\b
for your word boundaries, you can use the same logic, but this time using which.min
:
checkstreet <- function(x){
x <- tolower(x)
thenames <- c("street","st","avenue","ave","av",
"way","boulevard", "bd", "road", "rd",
"place", "pl", "esplanade","terrace","parade",
"drive","dr","park","lane","crescent","court",
"cres")
which.min(lapply(x,grep,thenames))
}
Upvotes: 3
Reputation: 1790
You could do it by matching each of the search words individually in a loop and then scoring the match, giving a higher score to matches that are placed earlier in your search list:
## Vector of search terms:
matchVec <- strsplit("STREET|\\bST\\b|AVENUE|\\bAVE\\b|\\bAV\\b|WAY|BOULEVARD|\\bBD\\b|ROAD|\\bRD\\b|PLACE|\\bPL\\b|ESPLANADE|TERRACE|PARADE|DRIVE|\\bDR\\b|\\bPARK\\b|LANE|CRESCENT|\\bCOURT\\b|b\\CRES\\b", "\\|")[[1]]
## Function to determine score of the match:
scoreMatch <- function(myString, matchVec){
## Position of matches in the search list:
position <- which(vapply(matchVec, function(matchStr) grepl(pattern = matchStr, x = myString),
logical(1)))
## Score: First search term gets the highest score, second gets second
## highest score etc. No match = score 0:
score <- ifelse(length(position) > 0, length(matchVec) - position + 1, 0)
}
## Determine score of each element/word in your vector:
scoreVec <- vapply(tempval, function(x) scoreMatch(x, matchVec), numeric(1))
## Find index with the highest score:
stID <- which.max(scoreVec)
Upvotes: 1