Reputation: 20811
From this question which was closed, the op asked how to extract rank, first, middle, and last from the strings
x <- c("Marshall Robert Forsyth", "Deputy Sheriff John A. Gooch",
"Constable Darius Quimby", "High Sheriff John Caldwell Cook")
# rank first middle last
# Marshall Robert Forsyth "Marshall" "Robert" "" "Forsyth"
# Deputy Sheriff John A. Gooch "Deputy Sheriff" "John" "A." "Gooch"
# Constable Darius Quimby "Constable" "Darius" "" "Quimby"
# High Sheriff John Caldwell. Cook "High Sheriff" "John" "Caldwell" "Cook"
I came up with this which only works if the middle name includes a period; otherwise, the pattern for rank captures as much as it can from the beginning of the line.
pat <- '(?i)(?<rank>[a-z ]+)\\s(?<first>[a-z]+)\\s(?:(?<middle>[a-z.]+)\\s)?(?<last>[a-z]+)'
f <- function(x, pattern) {
m <- gregexpr(pattern, x, perl = TRUE)[[1]]
s <- attr(m, "capture.start")
l <- attr(m, "capture.length")
n <- attr(m, "capture.names")
setNames(mapply('substr', x, s, s + l - 1L), n)
}
do.call('rbind', Map(f, x, pat))
# rank first middle last
# Marshall Robert Forsyth "Marshall" "Robert" "" "Forsyth"
# Deputy Sheriff John A. Gooch "Deputy Sheriff" "John" "A." "Gooch"
# Constable Darius Quimby "Constable" "Darius" "" "Quimby"
# High Sheriff John Caldwell Cook "High Sheriff John" "Caldwell" "" "Cook"
So this would work if the middle name was either not given or included a period
x <- c("Marshall Robert Forsyth", "Deputy Sheriff John A. Gooch",
"Constable Darius Quimby", "High Sheriff John Caldwell. Cook")
do.call('rbind', Map(f, x, pat))
So my question is is there a way to prioritize matching from the end of the string such that this pattern matches last, middle, first, then leaving everything else for rank.
Can I do this without reversing the string or something hacky like that? Also, maybe there is a better pattern since I am not great with regex.
Related - [1] [2] - I don't think these will work since another pattern was suggested rather than answering the question. Also, in this example, the number of words in the rank is arbitrary, and the pattern matching the rank would also work for the first name.
Upvotes: 9
Views: 562
Reputation: 468
We cannot start matching from the end, there are no any modifiers for that in any regex systems I know. But we can check how many words do we have until the end, and restrain our greediness :). The below regex is doing this.
^(?<rank>(?:(?:[ \t]|^)[a-z]+)+?)(?!(?:[ \t][a-z.]+){4,}$)[ \t](?<first>[a-z]+)[ \t](?:(?<middle>[a-z.]+)[ \t])?(?<last>[a-z]+)$
when you have First, Last and more than 1 word for the rank, the part of rank will become a First name.
To solve this you have to define a list of rank prefixes which mean that there's another word definitely goes after it and capture it in a greedy way.
E.g.: Deputy,High.
Upvotes: 2
Reputation: 344
My R is rusty, but placing a ?
after a quantifier makes it non-greedy instead of greedy in all regex engines that I am aware of. So to answer your main question:
Is there a way to prioritize matching from the end of the string such that this pattern matches last, middle, first, then leaving everything else for rank?
You should be able to do this by making the rank match section of the pattern non-greedy by adding a ?
after the +
.
(?<rank>[a-z ]+?)
Full pattern:
pat <- '(?i)(?<rank>[a-z ]+?)\\s(?<first>[a-z]+)\\s(?:(?<middle>[a-z.]+)\\s)?(?<last>[a-z]+)'
Upvotes: 0