Pablo Boswell
Pablo Boswell

Reputation: 845

R regexpr word word capture

I am trying to match the county name of a state in a string.

strings <- c("High School Graduate or Higher (5-year estimate) in Jefferson Parish, LA"
             ,"High School Graduate or Higher (5-year estimate) in Jefferson Davis Parish, LA")

countyName <- "Jefferson"
stateAbb <- "LA"

test <- gregexpr(paste0(countyName," (\\w), ",stateAbb,"$"),strings,ignore.case=T,perl=T)

I cannot get test to actually return anything.

The code works if I replace \\w with .* but then "Jefferson" will also match lines with "Jefferson Davis".

Of course, when the county Name is actually "Jefferson Davis", I want to match "Jefferson Davis"

Upvotes: 1

Views: 65

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627600

Your current regex only match a single "word" char (that is, a letter, digit or _ symbol) after the countyName. To make it match 1 or more "word" chars, add a + quantifier to \w:

test <- gregexpr(paste0(countyName," (\\w+), ",stateAbb,"$"),strings,ignore.case=T,perl=T)
                                         ^

The resulting regex will look like

Jefferson (\w+), LA$

See the regex demo

Details:

  • Jefferson - a literal substring
  • - a space
  • (\w+) - a capturing group (perhaps, you do not even need it, remove ( and ) if you do not need to access this submatch) matching 1 or more letters, digits or _ symbols
  • , - a comma and then a sapce
  • LA - a literal substring
  • $ - end of string.

Upvotes: 1

Related Questions