Reputation: 2940

R Regex to identify 2 or 3 consecutive capitalized words in string [R]

I'm trying to replicate this answer using R regex and limiting to only 2/3 consecutive capitalizations and accounting for words entirely capitalized: Get consecutive capitalized words using regex

The idea is to pull names from other jumbled word garbage:

    test_string <- "we need a test for Andrew Smith or other names like Samuel L Jackson, but we Don't Want Weird Instances Where more stuff is capitalized, but we do want where the entire name is capitalized, like DEREK JETER or MIKE NELSON TROUT"

    desired_extract
    [1] Andrew Smith
    [2] Samuel L Jackson
    [3] DEREK JETER
    [4] MIKE NELSON TROUT

Upvotes: 0

Answers (3)

Wiktor Stribiżew

Reputation: 627537

Use a PCRE regex with a base R regmatches/gregexpr using the SKIP-FAIL technique to match and skip chunks of 4 or more capitalized words and only keeping the 1 to 3 capitalized word chunks:

(*UCP)\b\p{Lu}\p{L}*(?:\s+\p{Lu}\p{L}*){3,}\b(*SKIP)(*F)|\b\p{Lu}\p{L}*(?:\s+\p{Lu}\p{L}*){1,2}\b

See the regex demo

Details

(*UCP) - PCRE verb that makes \b, \s Unicode aware
\b\p{Lu}\p{L}*(?:\s+\p{Lu}\p{L}*){3,}\b - a word boundary (\b), an uppercase letter followed with 0+ lowercase ones (\p{Lu}\p{L}*, a "capitalized word"), then 3 or more repetitions of 1+ whitespaces (\s+) followed with a capitalized word
(*SKIP)(*F) - if the match is found with this alternative, discard it and go on to look for another match
| - or
\b\p{Lu}\p{L}*(?:\s+\p{Lu}\p{L}*){1,2}\b - 2 or 3 whitespace separated capitalized words within word boundaries.

See R demo online:

test_string <- "we need a test for Andrew Smith or other names like Samuel L Jackson, but we Don't Want Weird Instances Where more stuff is capitalized, but we do want where the entire name is capitalized, like DEREK JETER or MIKE NELSON TROUT"
block <- "\\b\\p{Lu}\\p{L}*(?:\\s+\\p{Lu}\\p{L}*)"
regex <- paste0("(*UCP)", block, "{3,}\\b(*SKIP)(*F)|", block, "{1,2}\\b")
##regex <- "(*UCP)\b\p{Lu}\p{L}*(?:\s+\p{Lu}\p{L}*){3,}\b(*SKIP)(*F)|\b\p{Lu}\p{L}*(?:\s+\p{Lu}\p{L}*){1,2}\b"
regmatches(test_string, gregexpr(regex, test_string, perl=TRUE))

Output:

[[1]]
[1] "Andrew Smith"      "Samuel L Jackson"  "DEREK JETER"      
[4] "MIKE NELSON TROUT"

Upvotes: 2

Adam Sampson

Reputation: 2021

What makes this a bit difficult is that you can't nest a forward lookahead operator inside of a clause followed by {2,3}. Unfortunately, the best I can do is put this together longhand.

stringr::str_extract_all(test_string,"(?<!([A-Z][^ ]{0,20} ))([A-Z][^ ,.]*)[ ,.]([A-Z][^ ,.]*)([ ,.]([A-Z][^ ,.]*))?(?=([ ,.]|$))(?!( [A-Z]))")

Results:

[[1]]
[1] "Andrew Smith"      "Samuel L Jackson"  "DEREK JETER"       "MIKE NELSON TROUT"

This used negative lookbehind, forward lookahead, and negative forward lookahead to identify whether the words are followed by other capitals. Explanation is below and is partially spread out for legibility.

# Negative lookback to make sure there wasn't a word starting with a capital and having up to 20 
# characters before the first word in our sequence.
# Note: Lookbehind requires a bounded possibility set such as {,} and won't work with * or +
(?<!([A-Z][^ ]{0,20} )
# A word starting with a capital, followed by 0 or more characters that aren't a space, period, 
# or comma.
([A-Z][^ ,.]*)
# A space a period or a comma.
[ ,.]
# A word starting with a capital, followed by 0 or more characters that aren't a space, period, or 
# comma.
([A-Z][^ ,.]*)
# Maybe a third word indicated by a space/period/comma followed by a word starting with a 
# capital...
([ ,.]([A-Z][^ ,.]*))?
# Forward lookahead to make sure the last character in the capture is followed by a space, comma, 
# period, or end of line character. (Don't cut words in half)
(?=([ ,.]|$))
# Negative forward lookahead to make sure there isn't another word starting with a capital after 
# our word sequence.
(?!( [A-Z]))

Upvotes: 2

Philippe Poirier

Reputation: 56

What you are looking for is to use the {1,2} operator instead of + to limit amount of repetitions.

([A-Z]+[a-z]*(?=\s[A-Z])(?:\s[A-Z]+[a-z]*){1,2})

Edit: Edited so it also works on all caps words.

Upvotes: 2

R Regex to identify 2 or 3 consecutive capitalized words in string [R]

Answers (3)

Related Questions