rawr
rawr

Reputation: 20811

Start matching from the end of a string

From this question which was closed, the op asked how to extract rank, first, middle, and last from the strings

x <- c("Marshall Robert Forsyth", "Deputy Sheriff John A. Gooch",
       "Constable Darius Quimby", "High Sheriff John Caldwell Cook")

#                                  rank             first    middle      last     
# Marshall Robert Forsyth          "Marshall"       "Robert" ""          "Forsyth"
# Deputy Sheriff John A. Gooch     "Deputy Sheriff" "John"   "A."        "Gooch"  
# Constable Darius Quimby          "Constable"      "Darius" ""          "Quimby" 
# High Sheriff John Caldwell. Cook "High Sheriff"   "John"   "Caldwell"  "Cook"

I came up with this which only works if the middle name includes a period; otherwise, the pattern for rank captures as much as it can from the beginning of the line.

pat <- '(?i)(?<rank>[a-z ]+)\\s(?<first>[a-z]+)\\s(?:(?<middle>[a-z.]+)\\s)?(?<last>[a-z]+)'

f <- function(x, pattern) {
  m <- gregexpr(pattern, x, perl = TRUE)[[1]]
  s <- attr(m, "capture.start")
  l <- attr(m, "capture.length")
  n <- attr(m, "capture.names")
  setNames(mapply('substr', x, s, s + l - 1L), n)
}

do.call('rbind', Map(f, x, pat))

#                                 rank                first      middle last     
# Marshall Robert Forsyth         "Marshall"          "Robert"   ""     "Forsyth"
# Deputy Sheriff John A. Gooch    "Deputy Sheriff"    "John"     "A."   "Gooch"  
# Constable Darius Quimby         "Constable"         "Darius"   ""     "Quimby" 
# High Sheriff John Caldwell Cook "High Sheriff John" "Caldwell" ""     "Cook"

So this would work if the middle name was either not given or included a period

x <- c("Marshall Robert Forsyth", "Deputy Sheriff John A. Gooch",
       "Constable Darius Quimby", "High Sheriff John Caldwell. Cook")
do.call('rbind', Map(f, x, pat))

So my question is is there a way to prioritize matching from the end of the string such that this pattern matches last, middle, first, then leaving everything else for rank.

Can I do this without reversing the string or something hacky like that? Also, maybe there is a better pattern since I am not great with regex.


Related - [1] [2] - I don't think these will work since another pattern was suggested rather than answering the question. Also, in this example, the number of words in the rank is arbitrary, and the pattern matching the rank would also work for the first name.

Upvotes: 9

Views: 562

Answers (2)

NikitOn
NikitOn

Reputation: 468

We cannot start matching from the end, there are no any modifiers for that in any regex systems I know. But we can check how many words do we have until the end, and restrain our greediness :). The below regex is doing this.

This one will do what you want:

^(?<rank>(?:(?:[ \t]|^)[a-z]+)+?)(?!(?:[ \t][a-z.]+){4,}$)[ \t](?<first>[a-z]+)[ \t](?:(?<middle>[a-z.]+)[ \t])?(?<last>[a-z]+)$

Live preview in regex101.com

enter image description here

There's also one exception:

when you have First, Last and more than 1 word for the rank, the part of rank will become a First name.

enter image description here

To solve this you have to define a list of rank prefixes which mean that there's another word definitely goes after it and capture it in a greedy way.

E.g.: Deputy,High.

Upvotes: 2

Nathan Loyer
Nathan Loyer

Reputation: 344

My R is rusty, but placing a ? after a quantifier makes it non-greedy instead of greedy in all regex engines that I am aware of. So to answer your main question:

Is there a way to prioritize matching from the end of the string such that this pattern matches last, middle, first, then leaving everything else for rank?

You should be able to do this by making the rank match section of the pattern non-greedy by adding a ? after the +.

(?<rank>[a-z ]+?)

Full pattern:

pat <- '(?i)(?<rank>[a-z ]+?)\\s(?<first>[a-z]+)\\s(?:(?<middle>[a-z.]+)\\s)?(?<last>[a-z]+)'

Upvotes: 0

Related Questions